To submit homework, please upload both Rmd and html files to Bruinlearn by the deadline.

Q1. CFR of COVID-19

Of primary interest to public is the risk of dying from COVID-19. A commonly used measure is case fatality rate/ratio/risk (CFR), which is defined as \[ \frac{\text{number of deaths from disease}}{\text{number of diagnosed cases of disease}}. \] Apparently CFR is not a fixed constant; it changes with time, location, and other factors. Also CFR is different from the infection fatality rate (IFR), the probability that someone infected with COVID-19 dies from it.

In this exercise, we use logistic regression to study how US county-level CFR changes according to demographic information and some health-, education-, and economy-indicators.

Data sources

Sample code for data preparation

Load the tidyverse package for data manipulation and visualization.

# tidyverse of data manipulation and visualization

library(faraway)
library(gtsummary)
library(tidymodels)
library(tidyverse)
library(knitr)
library(survival)

Read in the data of COVID-19 cases reported on 2020-04-04.

county_count <- read_csv("./datasets/04-04-2020.csv.gz") %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(FIPS = as.numeric(FIPS)) %>%
  filter(Country_Region == "US") %>%
  print(width = Inf)
## # A tibble: 2,421 × 12
##     FIPS Admin2    Province_State Country_Region Last_Update           Lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
##  6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
##  7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
## 10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
##     Long_ Confirmed Deaths Recovered Active Combined_Key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -94.5         1      0         0      0 Adair, Iowa, US              
##  6  -85.3         3      0         0      0 Adair, Kentucky, US          
##  7  -92.6        10      0         0      0 Adair, Missouri, US          
##  8  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  9 -104.        294      9         0      0 Adams, Colorado, US          
## 10 -116.          1      0         0      0 Adams, Idaho, US             
## # ℹ 2,411 more rows

Standardize the variable names by changing them to lower case.

names(county_count) <- str_to_lower(names(county_count))
county_count
## # A tibble: 2,421 × 12
##     fips admin2   province_state country_region last_update           lat  long_
##    <dbl> <chr>    <chr>          <chr>          <dttm>              <dbl>  <dbl>
##  1 45001 Abbevil… South Carolina US             2020-04-04 23:34:21  34.2  -82.5
##  2 22001 Acadia   Louisiana      US             2020-04-04 23:34:21  30.3  -92.4
##  3 51001 Accomack Virginia       US             2020-04-04 23:34:21  37.8  -75.6
##  4 16001 Ada      Idaho          US             2020-04-04 23:34:21  43.5 -116. 
##  5 19001 Adair    Iowa           US             2020-04-04 23:34:21  41.3  -94.5
##  6 21001 Adair    Kentucky       US             2020-04-04 23:34:21  37.1  -85.3
##  7 29001 Adair    Missouri       US             2020-04-04 23:34:21  40.2  -92.6
##  8 40001 Adair    Oklahoma       US             2020-04-04 23:34:21  35.9  -94.7
##  9  8001 Adams    Colorado       US             2020-04-04 23:34:21  39.9 -104. 
## 10 16003 Adams    Idaho          US             2020-04-04 23:34:21  44.9 -116. 
## # ℹ 2,411 more rows
## # ℹ 5 more variables: confirmed <dbl>, deaths <dbl>, recovered <dbl>,
## #   active <dbl>, combined_key <chr>

Sanity check by displaying the unique US states and territories:

county_count %>%
  dplyr::select(province_state) %>%
  distinct() %>%
  arrange(province_state) %>%
  print(n = Inf)
## # A tibble: 58 × 1
##    province_state          
##    <chr>                   
##  1 Alabama                 
##  2 Alaska                  
##  3 Arizona                 
##  4 Arkansas                
##  5 California              
##  6 Colorado                
##  7 Connecticut             
##  8 Delaware                
##  9 Diamond Princess        
## 10 District of Columbia    
## 11 Florida                 
## 12 Georgia                 
## 13 Grand Princess          
## 14 Guam                    
## 15 Hawaii                  
## 16 Idaho                   
## 17 Illinois                
## 18 Indiana                 
## 19 Iowa                    
## 20 Kansas                  
## 21 Kentucky                
## 22 Louisiana               
## 23 Maine                   
## 24 Maryland                
## 25 Massachusetts           
## 26 Michigan                
## 27 Minnesota               
## 28 Mississippi             
## 29 Missouri                
## 30 Montana                 
## 31 Nebraska                
## 32 Nevada                  
## 33 New Hampshire           
## 34 New Jersey              
## 35 New Mexico              
## 36 New York                
## 37 North Carolina          
## 38 North Dakota            
## 39 Northern Mariana Islands
## 40 Ohio                    
## 41 Oklahoma                
## 42 Oregon                  
## 43 Pennsylvania            
## 44 Puerto Rico             
## 45 Recovered               
## 46 Rhode Island            
## 47 South Carolina          
## 48 South Dakota            
## 49 Tennessee               
## 50 Texas                   
## 51 Utah                    
## 52 Vermont                 
## 53 Virgin Islands          
## 54 Virginia                
## 55 Washington              
## 56 West Virginia           
## 57 Wisconsin               
## 58 Wyoming

We want to exclude entries from Diamond Princess, Grand Princess, Guam, Northern Mariana Islands, Puerto Rico, Recovered, and Virgin Islands, and only consider counties from 50 states and DC.

county_count <- county_count %>%
  filter(!(province_state %in% c("Diamond Princess", "Grand Princess", 
                                 "Recovered", "Guam", "Northern Mariana Islands", 
                                 "Puerto Rico", "Virgin Islands"))) %>%
  print(width = Inf)
## # A tibble: 2,413 × 12
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 19001 Adair     Iowa           US             2020-04-04 23:34:21  41.3
##  6 21001 Adair     Kentucky       US             2020-04-04 23:34:21  37.1
##  7 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  8 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  9  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
## 10 16003 Adams     Idaho          US             2020-04-04 23:34:21  44.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -94.5         1      0         0      0 Adair, Iowa, US              
##  6  -85.3         3      0         0      0 Adair, Kentucky, US          
##  7  -92.6        10      0         0      0 Adair, Missouri, US          
##  8  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  9 -104.        294      9         0      0 Adams, Colorado, US          
## 10 -116.          1      0         0      0 Adams, Idaho, US             
## # ℹ 2,403 more rows

Graphical summarize the COVID-19 confirmed cases and deaths on 2020-04-04 by state.

county_count %>%
  # turn into long format for easy plotting
  pivot_longer(confirmed:recovered, 
               names_to = "case", 
               values_to = "count") %>%
  group_by(province_state) %>%
  ggplot() + 
  geom_col(mapping = aes(x = province_state, y = `count`, fill = `case`)) + 
  # scale_y_log10() + 
  labs(title = "US COVID-19 Situation on 2020-04-04", x = "State") + 
  theme(axis.text.x = element_text(angle = 90))

Read in the 2020 county-level health ranking data.

county_info <- read_csv("./datasets/us-county-health-rankings-2020.csv.gz") %>%
  filter(!is.na(county)) %>%
  # cast fips into dbl for use as a key for joining tables
  mutate(fips = as.numeric(fips)) %>%
  dplyr::select(fips, 
         state,
         county,
         percent_fair_or_poor_health, 
         percent_smokers, 
         percent_adults_with_obesity, 
         # food_environment_index,
         percent_with_access_to_exercise_opportunities, 
         percent_excessive_drinking,
         # teen_birth_rate, 
         percent_uninsured,
         # primary_care_physicians_rate,
         # preventable_hospitalization_rate,
         # high_school_graduation_rate,
         percent_some_college,
         percent_unemployed,
         percent_children_in_poverty,
         # `80th_percentile_income`,
         # `20th_percentile_income`,
         percent_single_parent_households,
         # violent_crime_rate,
         percent_severe_housing_problems,
         overcrowding,
         # life_expectancy,
         # age_adjusted_death_rate,
         percent_adults_with_diabetes,
         # hiv_prevalence_rate,
         percent_food_insecure,
         # percent_limited_access_to_healthy_foods,
         percent_insufficient_sleep,
         percent_uninsured_2,
         median_household_income,
         average_traffic_volume_per_meter_of_major_roadways,
         percent_homeowners,
         # percent_severe_housing_cost_burden,
         population_2,
         percent_less_than_18_years_of_age,
         percent_65_and_over,
         percent_black,
         percent_asian,
         percent_hispanic,
         percent_female,
         percent_rural) %>%
  print(width = Inf)
## # A tibble: 3,142 × 30
##     fips state   county   percent_fair_or_poor_health percent_smokers
##    <dbl> <chr>   <chr>                          <dbl>           <dbl>
##  1  1001 Alabama Autauga                         20.9            18.1
##  2  1003 Alabama Baldwin                         17.5            17.5
##  3  1005 Alabama Barbour                         29.6            22.0
##  4  1007 Alabama Bibb                            19.4            19.1
##  5  1009 Alabama Blount                          21.7            19.2
##  6  1011 Alabama Bullock                         31.0            22.9
##  7  1013 Alabama Butler                          27.9            21.8
##  8  1015 Alabama Calhoun                         23.1            20.6
##  9  1017 Alabama Chambers                        24.0            19.4
## 10  1019 Alabama Cherokee                        20.7            17.5
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        33.3                                         69.1 
##  2                        31                                           73.7 
##  3                        41.7                                         53.2 
##  4                        37.6                                         16.3 
##  5                        33.8                                         15.6 
##  6                        37.2                                          2.50
##  7                        43.3                                         48.6 
##  8                        38.5                                         47.7 
##  9                        40.1                                         61.9 
## 10                        35                                           33.4 
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.0              8.72                 62.0
##  2                       18.0             11.3                  67.4
##  3                       12.8             12.2                  34.9
##  4                       15.6             10.2                  44.1
##  5                       14.2             13.4                  53.4
##  6                       12.1             11.4                  35.0
##  7                       11.9             11.2                  41.7
##  8                       13.8             11.9                  59.2
##  9                       12.7             11.9                  48.5
## 10                       14.1             11.2                  51.8
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.63                        19.3
##  2               3.62                        13.9
##  3               5.17                        43.9
##  4               3.97                        27.8
##  5               3.51                        18  
##  6               4.69                        68.3
##  7               4.79                        36.3
##  8               4.65                        26.5
##  9               3.91                        30.7
## 10               3.57                        24.7
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             26.2                            14.7        1.20 
##  2                             24.1                            13.6        1.27 
##  3                             56.6                            14.6        1.69 
##  4                             28.7                            10.5        0.255
##  5                             28.6                            10.5        1.89 
##  6                             74.8                            18.1        0.113
##  7                             52.7                            13.2        1.69 
##  8                             40.2                            13.7        1.54 
##  9                             46.6                            16.0        4.04 
## 10                             23.8                            13          1.5  
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         11.1                  13.2                       35.9
##  2                         10.7                  11.6                       33.3
##  3                         17.6                  22                         38.6
##  4                         14.5                  14.3                       38.1
##  5                         17                    10.7                       35.9
##  6                         23.7                  24.8                       45.0
##  7                         19.2                  20.6                       41.9
##  8                         17.5                  15.7                       41.3
##  9                         19.9                  17.9                       37.3
## 10                         15.2                  12.5                       35.4
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1                11.1                   59338
##  2                14.3                   57588
##  3                16.1                   34382
##  4                13                     46064
##  5                17.1                   50412
##  6                15.2                   29267
##  7                14.5                   37365
##  8                15.4                   45400
##  9                15.2                   39917
## 10                13.9                   42132
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                              88.5                74.9
##  2                                              87.0                73.6
##  3                                             102.                 61.4
##  4                                              29.3                75.1
##  5                                              33.4                78.6
##  6                                               4.07               75.5
##  7                                              19.3                69.9
##  8                                             110.                 69.5
##  9                                              20.3                67.8
## 10                                              25.9                79.0
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        55601                              23.7                15.6
##  2       218022                              21.6                20.4
##  3        24881                              20.9                19.4
##  4        22400                              20.5                16.5
##  5        57840                              23.2                18.2
##  6        10138                              21.1                16.4
##  7        19680                              22.2                20.3
##  8       114277                              21.6                17.7
##  9        33615                              20.8                19.5
## 10        26032                              19.2                23.0
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1         19.3          1.22              2.97           51.4          42.0
##  2          8.78         1.15              4.65           51.5          42.3
##  3         48.0          0.454             4.28           47.2          67.8
##  4         21.1          0.237             2.62           46.8          68.4
##  5          1.46         0.320             9.57           50.7          90.0
##  6         69.5          0.187             7.96           45.5          51.4
##  7         44.6          1.32              1.51           53.4          71.2
##  8         20.9          0.964             3.91           51.9          33.7
##  9         39.6          1.33              2.56           52.1          49.1
## 10          4.24         0.338             1.62           50.5          85.7
## # ℹ 3,132 more rows

For stability in estimating CFR, we restrict to counties with \(\ge 5\) confirmed cases.

county_count <- county_count %>%
  filter(confirmed >= 5)

We join the COVID-19 count data and county-level information using FIPS (Federal Information Processing System) as key.

county_data <- county_count %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
## # A tibble: 1,466 × 41
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
##  8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
##  9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
## 10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -92.6        10      0         0      0 Adair, Missouri, US          
##  6  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  7 -104.        294      9         0      0 Adams, Colorado, US          
##  8  -91.4        16      0         0      0 Adams, Mississippi, US       
##  9  -98.5         8      0         0      0 Adams, Nebraska, US          
## 10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
##    state          county    percent_fair_or_poor_health percent_smokers
##    <chr>          <chr>                           <dbl>           <dbl>
##  1 South Carolina Abbeville                        19.9            17.3
##  2 Louisiana      Acadia                           20.9            21.5
##  3 Virginia       Accomack                         20.1            18.3
##  4 Idaho          Ada                              11.5            12.0
##  5 Missouri       Adair                            21.4            20.5
##  6 Oklahoma       Adair                            28.5            27.7
##  7 Colorado       Adams                            16.6            16.3
##  8 Mississippi    Adams                            27.3            22.2
##  9 Nebraska       Adams                            15.8            14.6
## 10 Pennsylvania   Adams                            15.3            16.2
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        36.7                                          59.0
##  2                        38.4                                          42.5
##  3                        36.3                                          37.4
##  4                        25.6                                          89.5
##  5                        27.9                                          78.3
##  6                        47.7                                          28.5
##  7                        27.8                                          93.1
##  8                        35.3                                          69.1
##  9                        36.7                                          81.6
## 10                        35.6                                          60.6
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.9             12.9                  52.5
##  2                       19.8             10.7                  43.6
##  3                       15.5             16.6                  45.1
##  4                       17.9              8.74                 73.8
##  5                       18.9             10.6                  65.3
##  6                       11.8             24.5                  35.1
##  7                       18.9             11.0                  57.0
##  8                       12.3             15.0                  41.7
##  9                       18.5              8.76                 70.8
## 10                       19.2              7.49                 57.3
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.98                        30.8
##  2               5.37                        35.4
##  3               3.81                        27  
##  4               2.46                        10.2
##  5               3.51                        19.9
##  6               4.17                        34.9
##  7               3.47                        12.6
##  8               6.21                        40.4
##  9               2.87                        14.4
## 10               3.27                        11.2
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             37.1                            14.3        0.463
##  2                             33.4                            12.3        3.51 
##  3                             45.9                            15.1        2.10 
##  4                             23.8                            14.0        1.46 
##  5                             29.5                            18.0        0.740
##  6                             38.3                            15.4        5.65 
##  7                             31.0                            18.1        5.37 
##  8                             66.4                            12.8        2.37 
##  9                             26.2                            10.5        0.904
## 10                             26.7                            12.3        1.88 
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         15.8                  15.2                       36.1
##  2                         11.4                  15.1                       32.4
##  3                         15.9                  14.1                       36.8
##  4                          7.9                  12                         26.3
##  5                          8.4                  17.5                       31.9
##  6                         24.3                  19.1                       39.5
##  7                          7.7                   8                         31.0
##  8                         13.2                  24.7                       41.1
##  9                         11                    11.7                       30.1
## 10                          8.5                   8.3                       34.7
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1               15.9                    42412
##  2               14.0                    40484
##  3               19.4                    42879
##  4               11.1                    66827
##  5               12.3                    40395
##  6               29.6                    35156
##  7               13.8                    70199
##  8               18.7                    33392
##  9               10.7                    55167
## 10                8.46                   62877
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                               11.6               76.3
##  2                                               63.7               70.8
##  3                                               60.0               67.9
##  4                                              277.                68.4
##  5                                               45.8               60.0
##  6                                               16.7               68.6
##  7                                              490.                65.2
##  8                                              150.                61.7
##  9                                               53.4               68.2
## 10                                              113.                77.2
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        24541                              20.1                21.8
##  2        62190                              25.8                15.3
##  3        32412                              20.5                23.6
##  4       469966                              23.8                14.4
##  5        25339                              18.4                14.8
##  6        22082                              26.6                15.9
##  7       511868                              26.5                10.5
##  8        31192                              20.1                18.8
##  9        31511                              23.7                18.2
## 10       102811                              20.0                20.4
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1        27.5           0.412             1.54           51.6         78.6 
##  2        17.9           0.320             2.73           51.2         51.7 
##  3        28.0           0.781             9.34           51.2        100   
##  4         1.24          2.81              8.31           49.9          5.47
##  5         2.85          2.28              2.57           51.9         37.9 
##  6         0.534         0.802             6.82           50.1         83.3 
##  7         3.19          4.37             40.4            49.5          3.62
##  8        52.4           0.513            11.3            47.9         37.2 
##  9         0.996         1.33             10.9            50.2         22.5 
## 10         1.60          0.875             7.11           50.8         53.7 
## # ℹ 1,456 more rows

Numerical summaries of each variable:

summary(county_data)
##       fips          admin2          province_state     country_region    
##  Min.   : 1001   Length:1466        Length:1466        Length:1466       
##  1st Qu.:18003   Class :character   Class :character   Class :character  
##  Median :29029   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :30076                                                           
##  3rd Qu.:42077                                                           
##  Max.   :90053                                                           
##  NA's   :13                                                              
##   last_update                       lat            long_        
##  Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
##  1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.56  
##  Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
##  Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
##  3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.38   3rd Qu.: -81.22  
##  Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
##                                NA's   :19      NA's   :19       
##    confirmed           deaths           recovered     active 
##  Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
##  1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
##  Median :   20.0   Median :   0.000   Median :0   Median :0  
##  Mean   :  208.8   Mean   :   4.842   Mean   :0   Mean   :0  
##  3rd Qu.:   68.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
##  Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
##                                                              
##  combined_key          state              county         
##  Length:1466        Length:1466        Length:1466       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.147   Median :32.95              
##  Mean   :17.594              Mean   :17.153   Mean   :32.41              
##  3rd Qu.:20.377              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##  NA's   :28                  NA's   :28       NA's   :28                 
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.70             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.14                                Mean   :17.92             
##  3rd Qu.: 85.94                                3rd Qu.:20.00             
##  Max.   :100.00                                Max.   :28.62             
##  NA's   :28                                    NA's   :28                
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :30.06        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.24        1st Qu.: 3.252    
##  Median : 9.925    Median :61.21        Median : 3.870    
##  Mean   :10.583    Mean   :60.87        Mean   : 4.071    
##  3rd Qu.:13.519    3rd Qu.:68.74        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##  NA's   :28        NA's   :28           NA's   :28        
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.09                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.84                   
##  3rd Qu.:24.50               3rd Qu.:38.94                   
##  Max.   :55.00               Max.   :80.00                   
##  NA's   :28                  NA's   :28                      
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.800              
##  1st Qu.:12.267                  1st Qu.: 1.378   1st Qu.: 9.125              
##  Median :14.439                  Median : 1.962   Median :11.300              
##  Mean   :15.079                  Mean   : 2.429   Mean   :11.749              
##  3rd Qu.:16.976                  3rd Qu.: 2.882   3rd Qu.:13.900              
##  Max.   :33.391                  Max.   :14.489   Max.   :34.100              
##  NA's   :28                      NA's   :28       NA's   :28                  
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.26         Mean   :33.88              Mean   :12.776     
##  3rd Qu.:15.20         3rd Qu.:36.56              3rd Qu.:16.541     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##  NA's   :28            NA's   :28                 NA's   :28         
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.05                                   
##  Median : 54317          Median : 105.00                                   
##  Mean   : 57584          Mean   : 201.39                                   
##  3rd Qu.: 64754          3rd Qu.: 206.92                                   
##  Max.   :140382          Max.   :4444.12                                   
##  NA's   :28              NA's   :28                                        
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.34      1st Qu.:   36502   1st Qu.:20.321                   
##  Median :69.98      Median :   75478   Median :22.182                   
##  Mean   :68.98      Mean   :  202450   Mean   :22.197                   
##  3rd Qu.:74.78      3rd Qu.:  180031   3rd Qu.:24.002                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##  NA's   :28         NA's   :28         NA's   :28                       
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.927      1st Qu.: 1.6175   1st Qu.: 0.68248   1st Qu.: 2.9419  
##  Median :17.222      Median : 5.6397   Median : 1.23421   Median : 5.5939  
##  Mean   :17.516      Mean   :12.4178   Mean   : 2.40412   Mean   :10.0010  
##  3rd Qu.:19.598      3rd Qu.:17.5931   3rd Qu.: 2.67550   3rd Qu.:11.0564  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##  NA's   :28          NA's   :28        NA's   :28         NA's   :28       
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.66   Median : 36.97  
##  Mean   :50.46   Mean   : 40.11  
##  3rd Qu.:51.35   3rd Qu.: 60.00  
##  Max.   :56.87   Max.   :100.00  
##  NA's   :28      NA's   :28

List rows in county_data that don’t have a match in county_count:

county_data %>%
  filter(is.na(state) & is.na(county)) %>%
  print(n = Inf)
## # A tibble: 28 × 41
##     fips admin2   province_state country_region last_update           lat  long_
##    <dbl> <chr>    <chr>          <chr>          <dttm>              <dbl>  <dbl>
##  1    NA DeKalb   Tennessee      US             2020-04-04 23:34:21  36.0  -85.8
##  2    NA DeSoto   Florida        US             2020-04-04 23:34:21  27.2  -81.8
##  3    NA Dukes a… Massachusetts  US             2020-04-04 23:34:21  41.4  -70.7
##  4    NA Fillmore Minnesota      US             2020-04-04 23:34:21  43.7  -92.1
##  5    NA Kansas … Missouri       US             2020-04-04 23:34:21  39.1  -94.6
##  6    NA LaSalle  Illinois       US             2020-04-04 23:34:21  41.3  -88.9
##  7    NA Manassas Virginia       US             2020-04-04 23:34:21  38.7  -77.5
##  8    NA McDuffie Georgia        US             2020-04-04 23:34:21  33.5  -82.5
##  9    NA Out of … Michigan       US             2020-04-04 23:34:21  NA     NA  
## 10    NA Out of … Tennessee      US             2020-04-04 23:34:21  NA     NA  
## 11 90005 Unassig… Arkansas       US             2020-04-04 23:34:21  NA     NA  
## 12 90008 Unassig… Colorado       US             2020-04-04 23:34:21  NA     NA  
## 13 90009 Unassig… Connecticut    US             2020-04-04 23:34:21  NA     NA  
## 14 90013 Unassig… Georgia        US             2020-04-04 23:34:21  NA     NA  
## 15 90015 Unassig… Hawaii         US             2020-04-04 23:34:21  NA     NA  
## 16 90017 Unassig… Illinois       US             2020-04-04 23:34:21  NA     NA  
## 17 90021 Unassig… Kentucky       US             2020-04-04 23:34:21  NA     NA  
## 18    NA Unassig… Louisiana      US             2020-04-04 23:34:21  NA     NA  
## 19 90023 Unassig… Maine          US             2020-04-04 23:34:21  NA     NA  
## 20 90025 Unassig… Massachusetts  US             2020-04-04 23:34:21  NA     NA  
## 21    NA Unassig… Michigan       US             2020-04-04 23:34:21  NA     NA  
## 22 90032 Unassig… Nevada         US             2020-04-04 23:34:21  NA     NA  
## 23 90034 Unassig… New Jersey     US             2020-04-04 23:34:21  NA     NA  
## 24 90044 Unassig… Rhode Island   US             2020-04-04 23:34:21  NA     NA  
## 25 90047 Unassig… Tennessee      US             2020-04-04 23:34:21  NA     NA  
## 26 90050 Unassig… Vermont        US             2020-04-04 23:34:21  NA     NA  
## 27 90053 Unassig… Washington     US             2020-04-04 23:34:21  NA     NA  
## 28    NA Weber    Utah           US             2020-04-04 23:34:21  41.3 -112. 
## # ℹ 34 more variables: confirmed <dbl>, deaths <dbl>, recovered <dbl>,
## #   active <dbl>, combined_key <chr>, state <chr>, county <chr>,
## #   percent_fair_or_poor_health <dbl>, percent_smokers <dbl>,
## #   percent_adults_with_obesity <dbl>,
## #   percent_with_access_to_exercise_opportunities <dbl>,
## #   percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
## #   percent_some_college <dbl>, percent_unemployed <dbl>, …

We found there are some rows that miss fips.

county_count %>%
  filter(is.na(fips)) %>%
  dplyr::select(fips, admin2, province_state) %>%
  print(n = Inf)
## # A tibble: 13 × 3
##     fips admin2              province_state
##    <dbl> <chr>               <chr>         
##  1    NA DeKalb              Tennessee     
##  2    NA DeSoto              Florida       
##  3    NA Dukes and Nantucket Massachusetts 
##  4    NA Fillmore            Minnesota     
##  5    NA Kansas City         Missouri      
##  6    NA LaSalle             Illinois      
##  7    NA Manassas            Virginia      
##  8    NA McDuffie            Georgia       
##  9    NA Out of MI           Michigan      
## 10    NA Out of TN           Tennessee     
## 11    NA Unassigned          Louisiana     
## 12    NA Unassigned          Michigan      
## 13    NA Weber               Utah

We need to (1) manually set the fips for some counties, (2) discard those Unassigned, unassigned or Out of, and (3) try to join with county_info again.

county_data <- county_count %>%
  # manually set FIPS for some counties
  mutate(fips = ifelse(admin2 == "DeKalb" & province_state == "Tennessee", 47041, fips)) %>%
  mutate(fips = ifelse(admin2 == "DeSoto" & province_state == "Florida", 12027, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Dona Ana" & province_state == "New Mexico", 35013, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Dukes and Nantucket" & province_state == "Massachusetts", 25019, fips)) %>% 
  mutate(fips = ifelse(admin2 == "Fillmore" & province_state == "Minnesota", 27045, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Harris" & province_state == "Texas", 48201, fips)) %>%  
  #mutate(fips = ifelse(admin2 == "Kenai Peninsula" & province_state == "Alaska", 2122, fips)) %>%  
  mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Illinois", 17099, fips)) %>%
  #mutate(fips = ifelse(admin2 == "LaSalle" & province_state == "Louisiana", 22059, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Lac qui Parle" & province_state == "Minnesota", 27073, fips)) %>%  
  mutate(fips = ifelse(admin2 == "Manassas" & province_state == "Virginia", 51683, fips)) %>%
  #mutate(fips = ifelse(admin2 == "Matanuska-Susitna" & province_state == "Alaska", 2170, fips)) %>%
  mutate(fips = ifelse(admin2 == "McDuffie" & province_state == "Georgia", 13189, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McIntosh" & province_state == "Georgia", 13191, fips)) %>%
  #mutate(fips = ifelse(admin2 == "McKean" & province_state == "Pennsylvania", 42083, fips)) %>%
  mutate(fips = ifelse(admin2 == "Weber" & province_state == "Utah", 49057, fips)) %>%
  filter(!(is.na(fips) | str_detect(admin2, "Out of") | str_detect(admin2, "Unassigned"))) %>%
  left_join(county_info, by = "fips") %>%
  print(width = Inf)
## # A tibble: 1,446 × 41
##     fips admin2    province_state country_region last_update           lat
##    <dbl> <chr>     <chr>          <chr>          <dttm>              <dbl>
##  1 45001 Abbeville South Carolina US             2020-04-04 23:34:21  34.2
##  2 22001 Acadia    Louisiana      US             2020-04-04 23:34:21  30.3
##  3 51001 Accomack  Virginia       US             2020-04-04 23:34:21  37.8
##  4 16001 Ada       Idaho          US             2020-04-04 23:34:21  43.5
##  5 29001 Adair     Missouri       US             2020-04-04 23:34:21  40.2
##  6 40001 Adair     Oklahoma       US             2020-04-04 23:34:21  35.9
##  7  8001 Adams     Colorado       US             2020-04-04 23:34:21  39.9
##  8 28001 Adams     Mississippi    US             2020-04-04 23:34:21  31.5
##  9 31001 Adams     Nebraska       US             2020-04-04 23:34:21  40.5
## 10 42001 Adams     Pennsylvania   US             2020-04-04 23:34:21  39.9
##     long_ confirmed deaths recovered active combined_key                 
##     <dbl>     <dbl>  <dbl>     <dbl>  <dbl> <chr>                        
##  1  -82.5         6      0         0      0 Abbeville, South Carolina, US
##  2  -92.4        65      2         0      0 Acadia, Louisiana, US        
##  3  -75.6         8      0         0      0 Accomack, Virginia, US       
##  4 -116.        360      3         0      0 Ada, Idaho, US               
##  5  -92.6        10      0         0      0 Adair, Missouri, US          
##  6  -94.7        14      0         0      0 Adair, Oklahoma, US          
##  7 -104.        294      9         0      0 Adams, Colorado, US          
##  8  -91.4        16      0         0      0 Adams, Mississippi, US       
##  9  -98.5         8      0         0      0 Adams, Nebraska, US          
## 10  -77.2        21      0         0      0 Adams, Pennsylvania, US      
##    state          county    percent_fair_or_poor_health percent_smokers
##    <chr>          <chr>                           <dbl>           <dbl>
##  1 South Carolina Abbeville                        19.9            17.3
##  2 Louisiana      Acadia                           20.9            21.5
##  3 Virginia       Accomack                         20.1            18.3
##  4 Idaho          Ada                              11.5            12.0
##  5 Missouri       Adair                            21.4            20.5
##  6 Oklahoma       Adair                            28.5            27.7
##  7 Colorado       Adams                            16.6            16.3
##  8 Mississippi    Adams                            27.3            22.2
##  9 Nebraska       Adams                            15.8            14.6
## 10 Pennsylvania   Adams                            15.3            16.2
##    percent_adults_with_obesity percent_with_access_to_exercise_opportunities
##                          <dbl>                                         <dbl>
##  1                        36.7                                          59.0
##  2                        38.4                                          42.5
##  3                        36.3                                          37.4
##  4                        25.6                                          89.5
##  5                        27.9                                          78.3
##  6                        47.7                                          28.5
##  7                        27.8                                          93.1
##  8                        35.3                                          69.1
##  9                        36.7                                          81.6
## 10                        35.6                                          60.6
##    percent_excessive_drinking percent_uninsured percent_some_college
##                         <dbl>             <dbl>                <dbl>
##  1                       15.9             12.9                  52.5
##  2                       19.8             10.7                  43.6
##  3                       15.5             16.6                  45.1
##  4                       17.9              8.74                 73.8
##  5                       18.9             10.6                  65.3
##  6                       11.8             24.5                  35.1
##  7                       18.9             11.0                  57.0
##  8                       12.3             15.0                  41.7
##  9                       18.5              8.76                 70.8
## 10                       19.2              7.49                 57.3
##    percent_unemployed percent_children_in_poverty
##                 <dbl>                       <dbl>
##  1               3.98                        30.8
##  2               5.37                        35.4
##  3               3.81                        27  
##  4               2.46                        10.2
##  5               3.51                        19.9
##  6               4.17                        34.9
##  7               3.47                        12.6
##  8               6.21                        40.4
##  9               2.87                        14.4
## 10               3.27                        11.2
##    percent_single_parent_households percent_severe_housing_problems overcrowding
##                               <dbl>                           <dbl>        <dbl>
##  1                             37.1                            14.3        0.463
##  2                             33.4                            12.3        3.51 
##  3                             45.9                            15.1        2.10 
##  4                             23.8                            14.0        1.46 
##  5                             29.5                            18.0        0.740
##  6                             38.3                            15.4        5.65 
##  7                             31.0                            18.1        5.37 
##  8                             66.4                            12.8        2.37 
##  9                             26.2                            10.5        0.904
## 10                             26.7                            12.3        1.88 
##    percent_adults_with_diabetes percent_food_insecure percent_insufficient_sleep
##                           <dbl>                 <dbl>                      <dbl>
##  1                         15.8                  15.2                       36.1
##  2                         11.4                  15.1                       32.4
##  3                         15.9                  14.1                       36.8
##  4                          7.9                  12                         26.3
##  5                          8.4                  17.5                       31.9
##  6                         24.3                  19.1                       39.5
##  7                          7.7                   8                         31.0
##  8                         13.2                  24.7                       41.1
##  9                         11                    11.7                       30.1
## 10                          8.5                   8.3                       34.7
##    percent_uninsured_2 median_household_income
##                  <dbl>                   <dbl>
##  1               15.9                    42412
##  2               14.0                    40484
##  3               19.4                    42879
##  4               11.1                    66827
##  5               12.3                    40395
##  6               29.6                    35156
##  7               13.8                    70199
##  8               18.7                    33392
##  9               10.7                    55167
## 10                8.46                   62877
##    average_traffic_volume_per_meter_of_major_roadways percent_homeowners
##                                                 <dbl>              <dbl>
##  1                                               11.6               76.3
##  2                                               63.7               70.8
##  3                                               60.0               67.9
##  4                                              277.                68.4
##  5                                               45.8               60.0
##  6                                               16.7               68.6
##  7                                              490.                65.2
##  8                                              150.                61.7
##  9                                               53.4               68.2
## 10                                              113.                77.2
##    population_2 percent_less_than_18_years_of_age percent_65_and_over
##           <dbl>                             <dbl>               <dbl>
##  1        24541                              20.1                21.8
##  2        62190                              25.8                15.3
##  3        32412                              20.5                23.6
##  4       469966                              23.8                14.4
##  5        25339                              18.4                14.8
##  6        22082                              26.6                15.9
##  7       511868                              26.5                10.5
##  8        31192                              20.1                18.8
##  9        31511                              23.7                18.2
## 10       102811                              20.0                20.4
##    percent_black percent_asian percent_hispanic percent_female percent_rural
##            <dbl>         <dbl>            <dbl>          <dbl>         <dbl>
##  1        27.5           0.412             1.54           51.6         78.6 
##  2        17.9           0.320             2.73           51.2         51.7 
##  3        28.0           0.781             9.34           51.2        100   
##  4         1.24          2.81              8.31           49.9          5.47
##  5         2.85          2.28              2.57           51.9         37.9 
##  6         0.534         0.802             6.82           50.1         83.3 
##  7         3.19          4.37             40.4            49.5          3.62
##  8        52.4           0.513            11.3            47.9         37.2 
##  9         0.996         1.33             10.9            50.2         22.5 
## 10         1.60          0.875             7.11           50.8         53.7 
## # ℹ 1,436 more rows

Summarize again

summary(county_data)
##       fips          admin2          province_state     country_region    
##  Min.   : 1001   Length:1446        Length:1446        Length:1446       
##  1st Qu.:17186   Class :character   Class :character   Class :character  
##  Median :28156   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :29455                                                           
##  3rd Qu.:42048                                                           
##  Max.   :56039                                                           
##   last_update                       lat            long_        
##  Min.   :2020-04-04 23:34:21   Min.   :19.60   Min.   :-159.60  
##  1st Qu.:2020-04-04 23:34:21   1st Qu.:33.96   1st Qu.: -94.52  
##  Median :2020-04-04 23:34:21   Median :38.02   Median : -86.48  
##  Mean   :2020-04-04 23:34:21   Mean   :37.71   Mean   : -89.73  
##  3rd Qu.:2020-04-04 23:34:21   3rd Qu.:41.39   3rd Qu.: -81.21  
##  Max.   :2020-04-04 23:34:21   Max.   :64.81   Max.   : -68.65  
##    confirmed           deaths           recovered     active 
##  Min.   :    5.0   Min.   :   0.000   Min.   :0   Min.   :0  
##  1st Qu.:    9.0   1st Qu.:   0.000   1st Qu.:0   1st Qu.:0  
##  Median :   20.0   Median :   0.000   Median :0   Median :0  
##  Mean   :  207.2   Mean   :   4.854   Mean   :0   Mean   :0  
##  3rd Qu.:   66.0   3rd Qu.:   2.000   3rd Qu.:0   3rd Qu.:0  
##  Max.   :63306.0   Max.   :1905.000   Max.   :0   Max.   :0  
##  combined_key          state              county         
##  Length:1446        Length:1446        Length:1446       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.143   Median :32.90              
##  Mean   :17.594              Mean   :17.151   Mean   :32.39              
##  3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.68             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.15                                Mean   :17.91             
##  3rd Qu.: 85.97                                3rd Qu.:20.01             
##  Max.   :100.00                                Max.   :28.62             
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
##  Median : 9.937    Median :61.19        Median : 3.870    
##  Mean   :10.592    Mean   :60.83        Mean   : 4.071    
##  3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.07                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.83                   
##  3rd Qu.:24.50               3rd Qu.:38.93                   
##  Max.   :55.00               Max.   :80.00                   
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
##  1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
##  Median :14.439                  Median : 1.971   Median :11.30               
##  Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
##  3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
##  Max.   :33.391                  Max.   :14.489   Max.   :34.10               
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.25         Mean   :33.88              Mean   :12.786     
##  3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.09                                   
##  Median : 54317          Median : 104.63                                   
##  Mean   : 57600          Mean   : 200.72                                   
##  3rd Qu.: 64775          3rd Qu.: 206.78                                   
##  Max.   :140382          Max.   :4444.12                                   
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
##  Median :69.96      Median :   75382   Median :22.182                   
##  Mean   :68.99      Mean   :  201689   Mean   :22.204                   
##  3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
##  Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
##  Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
##  3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.65   Median : 36.97  
##  Mean   :50.46   Mean   : 40.12  
##  3rd Qu.:51.35   3rd Qu.: 60.04  
##  Max.   :56.87   Max.   :100.00

If there are variables with missing value for many counties, we go back and remove those variables from consideration.

Let’s create a final data frame for analysis.

county_data <- county_data %>%
  mutate(state = as.factor(state)) %>%
  dplyr::select(county, confirmed, deaths, state, percent_fair_or_poor_health:percent_rural)
summary(county_data)
##     county            confirmed           deaths                    state     
##  Length:1446        Min.   :    5.0   Min.   :   0.000   Georgia       :  96  
##  Class :character   1st Qu.:    9.0   1st Qu.:   0.000   Texas         :  80  
##  Mode  :character   Median :   20.0   Median :   0.000   North Carolina:  63  
##                     Mean   :  207.2   Mean   :   4.854   Mississippi   :  61  
##                     3rd Qu.:   66.0   3rd Qu.:   2.000   Indiana       :  58  
##                     Max.   :63306.0   Max.   :1905.000   Ohio          :  57  
##                                                          (Other)       :1031  
##  percent_fair_or_poor_health percent_smokers  percent_adults_with_obesity
##  Min.   : 8.121              Min.   : 5.909   Min.   :12.40              
##  1st Qu.:14.390              1st Qu.:14.899   1st Qu.:29.10              
##  Median :17.010              Median :17.143   Median :32.90              
##  Mean   :17.594              Mean   :17.151   Mean   :32.39              
##  3rd Qu.:20.398              3rd Qu.:19.365   3rd Qu.:36.20              
##  Max.   :38.887              Max.   :27.775   Max.   :51.00              
##                                                                          
##  percent_with_access_to_exercise_opportunities percent_excessive_drinking
##  Min.   :  0.00                                Min.   : 7.81             
##  1st Qu.: 59.95                                1st Qu.:15.68             
##  Median : 74.71                                Median :18.03             
##  Mean   : 71.15                                Mean   :17.91             
##  3rd Qu.: 85.97                                3rd Qu.:20.01             
##  Max.   :100.00                                Max.   :28.62             
##                                                                          
##  percent_uninsured percent_some_college percent_unemployed
##  Min.   : 2.263    Min.   :21.14        Min.   : 1.582    
##  1st Qu.: 6.754    1st Qu.:53.21        1st Qu.: 3.252    
##  Median : 9.937    Median :61.19        Median : 3.870    
##  Mean   :10.592    Mean   :60.83        Mean   : 4.071    
##  3rd Qu.:13.527    3rd Qu.:68.72        3rd Qu.: 4.690    
##  Max.   :31.208    Max.   :90.34        Max.   :18.092    
##                                                           
##  percent_children_in_poverty percent_single_parent_households
##  Min.   : 2.50               Min.   : 9.43                   
##  1st Qu.:12.82               1st Qu.:27.07                   
##  Median :18.40               Median :32.96                   
##  Mean   :19.46               Mean   :33.83                   
##  3rd Qu.:24.50               3rd Qu.:38.93                   
##  Max.   :55.00               Max.   :80.00                   
##                                                              
##  percent_severe_housing_problems  overcrowding    percent_adults_with_diabetes
##  Min.   : 6.562                  Min.   : 0.000   Min.   : 1.80               
##  1st Qu.:12.267                  1st Qu.: 1.379   1st Qu.: 9.10               
##  Median :14.439                  Median : 1.971   Median :11.30               
##  Mean   :15.082                  Mean   : 2.437   Mean   :11.75               
##  3rd Qu.:16.992                  3rd Qu.: 2.887   3rd Qu.:13.90               
##  Max.   :33.391                  Max.   :14.489   Max.   :34.10               
##                                                                               
##  percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##  Min.   : 3.40         Min.   :23.03              Min.   : 2.683     
##  1st Qu.:10.70         1st Qu.:31.42              1st Qu.: 7.865     
##  Median :12.70         Median :34.02              Median :12.027     
##  Mean   :13.25         Mean   :33.88              Mean   :12.786     
##  3rd Qu.:15.20         3rd Qu.:36.54              3rd Qu.:16.572     
##  Max.   :33.50         Max.   :46.71              Max.   :42.397     
##                                                                      
##  median_household_income average_traffic_volume_per_meter_of_major_roadways
##  Min.   : 25385          Min.   :   0.00                                   
##  1st Qu.: 46994          1st Qu.:  53.09                                   
##  Median : 54317          Median : 104.63                                   
##  Mean   : 57600          Mean   : 200.72                                   
##  3rd Qu.: 64775          3rd Qu.: 206.78                                   
##  Max.   :140382          Max.   :4444.12                                   
##                                                                            
##  percent_homeowners  population_2      percent_less_than_18_years_of_age
##  Min.   :24.13      Min.   :    2887   Min.   : 7.069                   
##  1st Qu.:64.36      1st Qu.:   36275   1st Qu.:20.326                   
##  Median :69.96      Median :   75382   Median :22.182                   
##  Mean   :68.99      Mean   :  201689   Mean   :22.204                   
##  3rd Qu.:74.77      3rd Qu.:  179982   3rd Qu.:24.019                   
##  Max.   :89.76      Max.   :10105518   Max.   :35.447                   
##                                                                         
##  percent_65_and_over percent_black     percent_asian      percent_hispanic 
##  Min.   : 7.722      Min.   : 0.1286   Min.   : 0.06245   Min.   : 0.7952  
##  1st Qu.:14.913      1st Qu.: 1.6168   1st Qu.: 0.68228   1st Qu.: 2.9451  
##  Median :17.225      Median : 5.6397   Median : 1.22863   Median : 5.6100  
##  Mean   :17.512      Mean   :12.4056   Mean   : 2.40009   Mean   :10.0338  
##  3rd Qu.:19.598      3rd Qu.:17.4904   3rd Qu.: 2.66813   3rd Qu.:11.1199  
##  Max.   :57.587      Max.   :81.9544   Max.   :42.95231   Max.   :96.3595  
##                                                                            
##  percent_female  percent_rural   
##  Min.   :34.63   Min.   :  0.00  
##  1st Qu.:50.00   1st Qu.: 17.11  
##  Median :50.65   Median : 36.97  
##  Mean   :50.46   Mean   : 40.12  
##  3rd Qu.:51.35   3rd Qu.: 60.04  
##  Max.   :56.87   Max.   :100.00  
## 

Display the 10 counties with highest CFR.

county_data %>%
  mutate(cfr = deaths / confirmed) %>%
  dplyr::select(county, state, confirmed, deaths, cfr) %>%
  arrange(desc(cfr)) %>%
  top_n(10)
## # A tibble: 18 × 5
##    county         state          confirmed deaths   cfr
##    <chr>          <fct>              <dbl>  <dbl> <dbl>
##  1 Emmet          Michigan               7      2 0.286
##  2 Grand Traverse Michigan              12      3 0.25 
##  3 Toole          Montana               12      3 0.25 
##  4 Fayette        Indiana               14      3 0.214
##  5 Concordia      Louisiana              5      1 0.2  
##  6 Harrison       Texas                  5      1 0.2  
##  7 Huntington     Indiana                5      1 0.2  
##  8 Isabella       Michigan              10      2 0.2  
##  9 McDuffie       Georgia                5      1 0.2  
## 10 Navarro        Texas                  5      1 0.2  
## 11 Orange         Indiana                5      1 0.2  
## 12 Perry          Pennsylvania           5      1 0.2  
## 13 Randolph       Indiana                5      1 0.2  
## 14 Rockingham     North Carolina         5      1 0.2  
## 15 Seneca         Ohio                   5      1 0.2  
## 16 Toombs         Georgia                5      1 0.2  
## 17 Vigo           Indiana               10      2 0.2  
## 18 Washington     Alabama                5      1 0.2

Write final data into a csv file for future use.

write_csv(county_data, "./datasets/covid19-county-data-20200404.csv.gz")

Note:

Given that the datasets were collected in the middle of the pandemic, what assumptions of CFR might be violated by defining CFR as deaths/confirmed from this data set?

Because COVID-19 pandemic was still ongoing in 2020, we should realize some critical assumptions for defining CFR are not met using this datasets.

  1. Numbers of confirmed cases do not reflect the number of diagnosed people. This is mainly limited by the availability of testing.

  2. Some confirmed cases may die later.

With acknowledgement of these severe limitations, we continue to use deaths/confirmed as a very rough proxy of CFR.

Q1.1

Read and run above code to generate a data frame county_data that includes county-level COVID-19 confirmed cases and deaths, demographic, and health related information.

Answer: Data frame county_data has been created.

Q1.2

What assumptions of logistic regression may be violated by this data set?

Answer: Each observation(row) is not independent to the other observation. If two counties are close to each other, people who are affected in one county may also affect people in the other county. Therefore, I would expect nearby counties have similar confirmed cases and deaths. Multicollinearity is another concern.

Q1.3

Run a logistic regression, using variables state, …, percent_rural as predictors.

Answer:

county_data <- county_data |>
  print(width = Inf)
## # A tibble: 1,446 × 31
##    county    confirmed deaths state          percent_fair_or_poor_health
##    <chr>         <dbl>  <dbl> <fct>                                <dbl>
##  1 Abbeville         6      0 South Carolina                        19.9
##  2 Acadia           65      2 Louisiana                             20.9
##  3 Accomack          8      0 Virginia                              20.1
##  4 Ada             360      3 Idaho                                 11.5
##  5 Adair            10      0 Missouri                              21.4
##  6 Adair            14      0 Oklahoma                              28.5
##  7 Adams           294      9 Colorado                              16.6
##  8 Adams            16      0 Mississippi                           27.3
##  9 Adams             8      0 Nebraska                              15.8
## 10 Adams            21      0 Pennsylvania                          15.3
##    percent_smokers percent_adults_with_obesity
##              <dbl>                       <dbl>
##  1            17.3                        36.7
##  2            21.5                        38.4
##  3            18.3                        36.3
##  4            12.0                        25.6
##  5            20.5                        27.9
##  6            27.7                        47.7
##  7            16.3                        27.8
##  8            22.2                        35.3
##  9            14.6                        36.7
## 10            16.2                        35.6
##    percent_with_access_to_exercise_opportunities percent_excessive_drinking
##                                            <dbl>                      <dbl>
##  1                                          59.0                       15.9
##  2                                          42.5                       19.8
##  3                                          37.4                       15.5
##  4                                          89.5                       17.9
##  5                                          78.3                       18.9
##  6                                          28.5                       11.8
##  7                                          93.1                       18.9
##  8                                          69.1                       12.3
##  9                                          81.6                       18.5
## 10                                          60.6                       19.2
##    percent_uninsured percent_some_college percent_unemployed
##                <dbl>                <dbl>              <dbl>
##  1             12.9                  52.5               3.98
##  2             10.7                  43.6               5.37
##  3             16.6                  45.1               3.81
##  4              8.74                 73.8               2.46
##  5             10.6                  65.3               3.51
##  6             24.5                  35.1               4.17
##  7             11.0                  57.0               3.47
##  8             15.0                  41.7               6.21
##  9              8.76                 70.8               2.87
## 10              7.49                 57.3               3.27
##    percent_children_in_poverty percent_single_parent_households
##                          <dbl>                            <dbl>
##  1                        30.8                             37.1
##  2                        35.4                             33.4
##  3                        27                               45.9
##  4                        10.2                             23.8
##  5                        19.9                             29.5
##  6                        34.9                             38.3
##  7                        12.6                             31.0
##  8                        40.4                             66.4
##  9                        14.4                             26.2
## 10                        11.2                             26.7
##    percent_severe_housing_problems overcrowding percent_adults_with_diabetes
##                              <dbl>        <dbl>                        <dbl>
##  1                            14.3        0.463                         15.8
##  2                            12.3        3.51                          11.4
##  3                            15.1        2.10                          15.9
##  4                            14.0        1.46                           7.9
##  5                            18.0        0.740                          8.4
##  6                            15.4        5.65                          24.3
##  7                            18.1        5.37                           7.7
##  8                            12.8        2.37                          13.2
##  9                            10.5        0.904                         11  
## 10                            12.3        1.88                           8.5
##    percent_food_insecure percent_insufficient_sleep percent_uninsured_2
##                    <dbl>                      <dbl>               <dbl>
##  1                  15.2                       36.1               15.9 
##  2                  15.1                       32.4               14.0 
##  3                  14.1                       36.8               19.4 
##  4                  12                         26.3               11.1 
##  5                  17.5                       31.9               12.3 
##  6                  19.1                       39.5               29.6 
##  7                   8                         31.0               13.8 
##  8                  24.7                       41.1               18.7 
##  9                  11.7                       30.1               10.7 
## 10                   8.3                       34.7                8.46
##    median_household_income average_traffic_volume_per_meter_of_major_roadways
##                      <dbl>                                              <dbl>
##  1                   42412                                               11.6
##  2                   40484                                               63.7
##  3                   42879                                               60.0
##  4                   66827                                              277. 
##  5                   40395                                               45.8
##  6                   35156                                               16.7
##  7                   70199                                              490. 
##  8                   33392                                              150. 
##  9                   55167                                               53.4
## 10                   62877                                              113. 
##    percent_homeowners population_2 percent_less_than_18_years_of_age
##                 <dbl>        <dbl>                             <dbl>
##  1               76.3        24541                              20.1
##  2               70.8        62190                              25.8
##  3               67.9        32412                              20.5
##  4               68.4       469966                              23.8
##  5               60.0        25339                              18.4
##  6               68.6        22082                              26.6
##  7               65.2       511868                              26.5
##  8               61.7        31192                              20.1
##  9               68.2        31511                              23.7
## 10               77.2       102811                              20.0
##    percent_65_and_over percent_black percent_asian percent_hispanic
##                  <dbl>         <dbl>         <dbl>            <dbl>
##  1                21.8        27.5           0.412             1.54
##  2                15.3        17.9           0.320             2.73
##  3                23.6        28.0           0.781             9.34
##  4                14.4         1.24          2.81              8.31
##  5                14.8         2.85          2.28              2.57
##  6                15.9         0.534         0.802             6.82
##  7                10.5         3.19          4.37             40.4 
##  8                18.8        52.4           0.513            11.3 
##  9                18.2         0.996         1.33             10.9 
## 10                20.4         1.60          0.875             7.11
##    percent_female percent_rural
##             <dbl>         <dbl>
##  1           51.6         78.6 
##  2           51.2         51.7 
##  3           51.2        100   
##  4           49.9          5.47
##  5           51.9         37.9 
##  6           50.1         83.3 
##  7           49.5          3.62
##  8           47.9         37.2 
##  9           50.2         22.5 
## 10           50.8         53.7 
## # ℹ 1,436 more rows
mod <- glm(cbind(deaths, confirmed - deaths) ~ . -county, family = "binomial", data = county_data)
summary(mod)
## 
## Call:
## glm(formula = cbind(deaths, confirmed - deaths) ~ . - county, 
##     family = "binomial", data = county_data)
## 
## Coefficients:
##                                                      Estimate Std. Error
## (Intercept)                                         1.682e+00  1.625e+00
## stateAlaska                                        -8.493e-01  1.041e+00
## stateArizona                                        6.061e-01  3.670e-01
## stateArkansas                                      -1.177e+00  4.946e-01
## stateCalifornia                                     2.053e-01  3.240e-01
## stateColorado                                       4.335e-01  2.642e-01
## stateConnecticut                                    4.387e-02  2.508e-01
## stateDelaware                                      -6.062e-01  3.565e-01
## stateDistrict of Columbia                          -1.359e+00  5.202e-01
## stateFlorida                                        2.603e-01  2.396e-01
## stateGeorgia                                        2.200e-01  1.965e-01
## stateHawaii                                        -1.672e+00  7.030e-01
## stateIdaho                                         -6.081e-01  4.250e-01
## stateIllinois                                       1.492e-01  2.240e-01
## stateIndiana                                        4.063e-01  2.205e-01
## stateIowa                                          -4.926e-01  3.933e-01
## stateKansas                                         3.751e-01  3.367e-01
## stateKentucky                                       3.673e-01  2.681e-01
## stateLouisiana                                      1.264e-01  2.126e-01
## stateMaine                                         -5.892e-01  4.451e-01
## stateMaryland                                      -7.426e-01  2.627e-01
## stateMassachusetts                                 -9.027e-01  2.649e-01
## stateMichigan                                       2.331e-01  2.081e-01
## stateMinnesota                                     -3.288e-01  3.327e-01
## stateMississippi                                    2.270e-01  2.442e-01
## stateMissouri                                      -3.473e-01  2.588e-01
## stateMontana                                       -5.572e-02  4.675e-01
## stateNebraska                                      -2.718e-01  5.581e-01
## stateNevada                                         2.790e-01  2.829e-01
## stateNew Hampshire                                 -1.910e+00  6.172e-01
## stateNew Jersey                                     1.841e-02  2.213e-01
## stateNew Mexico                                     7.225e-01  5.644e-01
## stateNew York                                      -9.251e-01  2.466e-01
## stateNorth Carolina                                -6.333e-01  2.474e-01
## stateNorth Dakota                                  -9.169e-01  7.659e-01
## stateOhio                                          -6.817e-02  2.334e-01
## stateOklahoma                                       8.480e-01  2.916e-01
## stateOregon                                         5.793e-02  3.179e-01
## statePennsylvania                                  -8.069e-01  2.300e-01
## stateRhode Island                                  -2.158e+00  6.259e-01
## stateSouth Carolina                                -5.238e-01  2.477e-01
## stateSouth Dakota                                  -7.228e-01  7.496e-01
## stateTennessee                                     -5.532e-01  2.251e-01
## stateTexas                                          8.430e-01  3.826e-01
## stateUtah                                          -9.928e-01  4.792e-01
## stateVermont                                       -1.210e+00  5.099e-01
## stateVirginia                                      -1.533e+00  3.178e-01
## stateWashington                                     5.438e-01  2.439e-01
## stateWest Virginia                                 -1.527e+00  7.345e-01
## stateWisconsin                                     -2.016e-01  2.809e-01
## stateWyoming                                       -1.374e+01  3.452e+02
## percent_fair_or_poor_health                        -5.431e-02  1.836e-02
## percent_smokers                                    -4.310e-02  1.756e-02
## percent_adults_with_obesity                        -5.682e-03  8.571e-03
## percent_with_access_to_exercise_opportunities      -2.890e-04  2.800e-03
## percent_excessive_drinking                         -1.536e-02  1.315e-02
## percent_uninsured                                  -1.848e-01  8.745e-02
## percent_some_college                               -1.634e-02  5.382e-03
## percent_unemployed                                  5.588e-02  4.630e-02
## percent_children_in_poverty                         3.027e-02  7.658e-03
## percent_single_parent_households                   -1.520e-02  6.238e-03
## percent_severe_housing_problems                    -4.474e-02  1.076e-02
## overcrowding                                        1.984e-02  2.573e-02
## percent_adults_with_diabetes                       -2.174e-02  1.385e-02
## percent_food_insecure                              -9.537e-02  4.773e-02
## percent_insufficient_sleep                          4.438e-02  9.945e-03
## percent_uninsured_2                                 1.203e-01  7.164e-02
## median_household_income                            -1.084e-05  3.321e-06
## average_traffic_volume_per_meter_of_major_roadways  1.530e-05  4.559e-05
## percent_homeowners                                 -3.398e-02  6.146e-03
## population_2                                       -3.519e-08  1.676e-08
## percent_less_than_18_years_of_age                  -4.812e-02  1.624e-02
## percent_65_and_over                                 9.562e-03  1.177e-02
## percent_black                                       8.741e-03  5.856e-03
## percent_asian                                      -5.847e-04  5.986e-03
## percent_hispanic                                   -1.920e-02  8.405e-03
## percent_female                                      4.911e-02  2.768e-02
## percent_rural                                       1.074e-03  2.048e-03
##                                                    z value Pr(>|z|)    
## (Intercept)                                          1.035 0.300740    
## stateAlaska                                         -0.815 0.414808    
## stateArizona                                         1.652 0.098635 .  
## stateArkansas                                       -2.380 0.017303 *  
## stateCalifornia                                      0.634 0.526177    
## stateColorado                                        1.641 0.100812    
## stateConnecticut                                     0.175 0.861175    
## stateDelaware                                       -1.700 0.089074 .  
## stateDistrict of Columbia                           -2.613 0.008975 ** 
## stateFlorida                                         1.086 0.277405    
## stateGeorgia                                         1.120 0.262714    
## stateHawaii                                         -2.379 0.017369 *  
## stateIdaho                                          -1.431 0.152481    
## stateIllinois                                        0.666 0.505345    
## stateIndiana                                         1.843 0.065355 .  
## stateIowa                                           -1.252 0.210425    
## stateKansas                                          1.114 0.265215    
## stateKentucky                                        1.370 0.170696    
## stateLouisiana                                       0.595 0.552042    
## stateMaine                                          -1.324 0.185570    
## stateMaryland                                       -2.827 0.004696 ** 
## stateMassachusetts                                  -3.408 0.000654 ***
## stateMichigan                                        1.120 0.262561    
## stateMinnesota                                      -0.988 0.322955    
## stateMississippi                                     0.930 0.352585    
## stateMissouri                                       -1.342 0.179549    
## stateMontana                                        -0.119 0.905125    
## stateNebraska                                       -0.487 0.626265    
## stateNevada                                          0.986 0.324043    
## stateNew Hampshire                                  -3.095 0.001967 ** 
## stateNew Jersey                                      0.083 0.933691    
## stateNew Mexico                                      1.280 0.200470    
## stateNew York                                       -3.751 0.000176 ***
## stateNorth Carolina                                 -2.560 0.010465 *  
## stateNorth Dakota                                   -1.197 0.231273    
## stateOhio                                           -0.292 0.770288    
## stateOklahoma                                        2.908 0.003640 ** 
## stateOregon                                          0.182 0.855417    
## statePennsylvania                                   -3.509 0.000450 ***
## stateRhode Island                                   -3.448 0.000564 ***
## stateSouth Carolina                                 -2.115 0.034460 *  
## stateSouth Dakota                                   -0.964 0.334969    
## stateTennessee                                      -2.457 0.014004 *  
## stateTexas                                           2.203 0.027572 *  
## stateUtah                                           -2.072 0.038264 *  
## stateVermont                                        -2.373 0.017644 *  
## stateVirginia                                       -4.823 1.42e-06 ***
## stateWashington                                      2.229 0.025799 *  
## stateWest Virginia                                  -2.078 0.037668 *  
## stateWisconsin                                      -0.718 0.473007    
## stateWyoming                                        -0.040 0.968252    
## percent_fair_or_poor_health                         -2.958 0.003092 ** 
## percent_smokers                                     -2.454 0.014114 *  
## percent_adults_with_obesity                         -0.663 0.507376    
## percent_with_access_to_exercise_opportunities       -0.103 0.917808    
## percent_excessive_drinking                          -1.169 0.242584    
## percent_uninsured                                   -2.113 0.034560 *  
## percent_some_college                                -3.036 0.002399 ** 
## percent_unemployed                                   1.207 0.227472    
## percent_children_in_poverty                          3.952 7.75e-05 ***
## percent_single_parent_households                    -2.437 0.014816 *  
## percent_severe_housing_problems                     -4.158 3.21e-05 ***
## overcrowding                                         0.771 0.440593    
## percent_adults_with_diabetes                        -1.570 0.116373    
## percent_food_insecure                               -1.998 0.045683 *  
## percent_insufficient_sleep                           4.462 8.12e-06 ***
## percent_uninsured_2                                  1.680 0.093025 .  
## median_household_income                             -3.266 0.001092 ** 
## average_traffic_volume_per_meter_of_major_roadways   0.336 0.737150    
## percent_homeowners                                  -5.529 3.22e-08 ***
## population_2                                        -2.100 0.035741 *  
## percent_less_than_18_years_of_age                   -2.964 0.003038 ** 
## percent_65_and_over                                  0.813 0.416486    
## percent_black                                        1.493 0.135505    
## percent_asian                                       -0.098 0.922197    
## percent_hispanic                                    -2.285 0.022327 *  
## percent_female                                       1.774 0.076027 .  
## percent_rural                                        0.524 0.599962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3257.7  on 1445  degrees of freedom
## Residual deviance: 1842.8  on 1368  degrees of freedom
## AIC: 3880.6
## 
## Number of Fisher Scoring iterations: 14

Q1.4

Interpret the regression coefficients of 3 significant predictors with p-value <0.01.

Answer:

percent_less_than_18_years_of_age - For each 1 percent increase in the proportion of people less than 18 years of age, the odds of death decreases by 4.47%, while adjusting for other variables.

percent_insufficient_sleep - For each 1 percent increase in the proportion of people with insufficient sleep, the odds of death decreases by 4.3%, while adjusting for other variables.

percent_homeowners - For each 1 percent increase in the proportion of homeowners, the odds of death decreases by 3.3%, while adjusting for other variables.

Q1.5

Apply analysis of deviance to (1) evaluate the goodness of fit of the model and (2) compare the model to the intercept-only model.

Answer:

#(1)
pchisq(mod$deviance, mod$df.residual, lower = FALSE)
## [1] 1.099787e-16
#(2)
pchisq(mod$null.deviance - mod$deviance, mod$df.null - mod$df.residual, lower.tail = FALSE)
## [1] 5.173983e-245
  1. We can see the model is not a good fit as the p-value is less than .05. (2) The model gives a significantly better fit than the null (intercept-only) model as the p-value is less than .05.

Q1.6

Perform analysis of deviance to evaluate the significance of each predictor. Display the 10 most significant predictors.

Answer: 10 most significant predictors are displayed below.

# Running drop1 with Chi-squared test
result <- drop1(mod, test = "Chi")

# Sorting the results by the Chi-squared statistic in descending order and selecting the top 10
top_predictors <- head(result[order(result$`Pr(>Chi)`), ], 10)

# Display the top 10 significant predictors
print(top_predictors)
## Single term deletions
## 
## Model:
## cbind(deaths, confirmed - deaths) ~ (county + state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_with_access_to_exercise_opportunities + 
##     percent_excessive_drinking + percent_uninsured + percent_some_college + 
##     percent_unemployed + percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + average_traffic_volume_per_meter_of_major_roadways + 
##     percent_homeowners + population_2 + percent_less_than_18_years_of_age + 
##     percent_65_and_over + percent_black + percent_asian + percent_hispanic + 
##     percent_female + percent_rural) - county
##                                   Df Deviance    AIC    LRT  Pr(>Chi)    
## state                             50   2413.8 4351.7 571.08 < 2.2e-16 ***
## percent_homeowners                 1   1873.7 3909.6  30.98 2.601e-08 ***
## percent_insufficient_sleep         1   1862.8 3898.6  19.99 7.769e-06 ***
## percent_severe_housing_problems    1   1859.9 3895.8  17.16 3.444e-05 ***
## percent_children_in_poverty        1   1858.4 3894.3  15.67 7.553e-05 ***
## median_household_income            1   1853.5 3889.3  10.70  0.001071 ** 
## percent_some_college               1   1852.0 3887.9   9.25  0.002349 ** 
## percent_fair_or_poor_health        1   1851.5 3887.4   8.80  0.003016 ** 
## percent_less_than_18_years_of_age  1   1851.5 3887.4   8.75  0.003094 ** 
## percent_smokers                    1   1848.8 3884.7   6.02  0.014122 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q1.7

Construct confidence intervals of regression coefficients.

Answer:

confint(mod)
##                                                            2.5 %        97.5 %
## (Intercept)                                        -1.540685e+00  4.829928e+00
## stateAlaska                                        -3.753299e+00  7.665854e-01
## stateArizona                                       -1.121919e-01  1.326975e+00
## stateArkansas                                      -2.272222e+00 -2.941005e-01
## stateCalifornia                                    -4.270811e-01  8.432042e-01
## stateColorado                                      -8.047425e-02  9.558273e-01
## stateConnecticut                                   -4.431493e-01  5.408817e-01
## stateDelaware                                      -1.343063e+00  6.491602e-02
## stateDistrict of Columbia                          -2.385522e+00 -3.452739e-01
## stateFlorida                                       -2.045630e-01  7.356315e-01
## stateGeorgia                                       -1.575974e-01  6.138168e-01
## stateHawaii                                        -3.239788e+00 -4.080654e-01
## stateIdaho                                         -1.491142e+00  1.890745e-01
## stateIllinois                                      -2.839137e-01  5.952862e-01
## stateIndiana                                       -2.054079e-02  8.447684e-01
## stateIowa                                          -1.301477e+00  2.507241e-01
## stateKansas                                        -3.042060e-01  1.020971e+00
## stateKentucky                                      -1.643561e-01  8.894116e-01
## stateLouisiana                                     -2.837727e-01  5.505662e-01
## stateMaine                                         -1.538751e+00  2.289083e-01
## stateMaryland                                      -1.256754e+00 -2.257937e-01
## stateMassachusetts                                 -1.418191e+00 -3.792222e-01
## stateMichigan                                      -1.674387e-01  6.492574e-01
## stateMinnesota                                     -9.984654e-01  3.106177e-01
## stateMississippi                                   -2.569089e-01  7.032953e-01
## stateMissouri                                      -8.618337e-01  1.558494e-01
## stateMontana                                       -1.068580e+00  7.935561e-01
## stateNebraska                                      -1.524163e+00  7.176783e-01
## stateNevada                                        -2.755078e-01  8.345940e-01
## stateNew Hampshire                                 -3.354740e+00 -8.509625e-01
## stateNew Jersey                                    -4.091492e-01  4.592997e-01
## stateNew Mexico                                    -4.112221e-01  1.809014e+00
## stateNew York                                      -1.403926e+00 -4.363713e-01
## stateNorth Carolina                                -1.123934e+00 -1.511085e-01
## stateNorth Dakota                                  -2.782881e+00  3.734572e-01
## stateOhio                                          -5.211860e-01  3.949778e-01
## stateOklahoma                                       2.753046e-01  1.419962e+00
## stateOregon                                        -5.767115e-01  6.730255e-01
## statePennsylvania                                  -1.252486e+00 -3.499987e-01
## stateRhode Island                                  -3.614905e+00 -1.077287e+00
## stateSouth Carolina                                -1.012535e+00 -3.916322e-02
## stateSouth Dakota                                  -2.568364e+00  5.247816e-01
## stateTennessee                                     -9.945980e-01 -1.098143e-01
## stateTexas                                          9.571634e-02  1.595777e+00
## stateUtah                                          -1.982670e+00 -8.921095e-02
## stateVermont                                       -2.328611e+00 -2.914186e-01
## stateVirginia                                      -2.181136e+00 -9.283012e-01
## stateWashington                                     7.105650e-02  1.028006e+00
## stateWest Virginia                                 -3.353514e+00 -3.202647e-01
## stateWisconsin                                     -7.521121e-01  3.500790e-01
## stateWyoming                                       -4.065085e+02 -4.687629e+02
## percent_fair_or_poor_health                        -9.034904e-02 -1.839060e-02
## percent_smokers                                    -7.751397e-02 -8.679085e-03
## percent_adults_with_obesity                        -2.246740e-02  1.113176e-02
## percent_with_access_to_exercise_opportunities      -5.745291e-03  5.231292e-03
## percent_excessive_drinking                         -4.111152e-02  1.042741e-02
## percent_uninsured                                  -3.573039e-01 -1.480706e-02
## percent_some_college                               -2.689944e-02 -5.803777e-03
## percent_unemployed                                 -3.626838e-02  1.451401e-01
## percent_children_in_poverty                         1.526815e-02  4.528901e-02
## percent_single_parent_households                   -2.741900e-02 -2.963883e-03
## percent_severe_housing_problems                    -6.578986e-02 -2.361278e-02
## overcrowding                                       -3.071034e-02  7.016017e-02
## percent_adults_with_diabetes                       -4.908876e-02  5.180565e-03
## percent_food_insecure                              -1.891933e-01 -2.105405e-03
## percent_insufficient_sleep                          2.490233e-02  6.388899e-02
## percent_uninsured_2                                -1.936819e-02  2.612330e-01
## median_household_income                            -1.735930e-05 -4.342325e-06
## average_traffic_volume_per_meter_of_major_roadways -7.390708e-05  1.048167e-04
## percent_homeowners                                 -4.605639e-02 -2.196463e-02
## population_2                                       -6.804682e-08 -2.338900e-09
## percent_less_than_18_years_of_age                  -7.990811e-02 -1.626028e-02
## percent_65_and_over                                -1.360161e-02  3.253736e-02
## percent_black                                      -2.712684e-03  2.024150e-02
## percent_asian                                      -1.233780e-02  1.112980e-02
## percent_hispanic                                   -3.569677e-02 -2.750001e-03
## percent_female                                     -4.049216e-03  1.044219e-01
## percent_rural                                      -2.958610e-03  5.070711e-03

Q1.8

Plot the deviance residuals against the fitted values. Are there potential outliers?

Answer:

county_data %>%
  mutate(devres  = residuals(mod, type = "deviance"),
         linpred = predict(mod, type = "link")) %>%
  ggplot + 
  geom_point(mapping = aes(x = linpred, y = devres)) +
  labs(x = "Linear predictor", y = "Deviance residual")

There are potential outliers with fitted values smaller than -15, while the majority of values are near -5.

Q1.9

Plot the half-normal plot. Are there potential outliers in predictor space?

Answer:

halfnorm(hatvalues(mod))

There are potential outliers on row 931 and row 367.

Q1.10

Find the best sub-model using the AIC criterion.

Answer:

stats::step(mod, trace = TRUE, direction = "back") %>%
  tbl_regression() %>%
  bold_labels()
## Start:  AIC=3880.65
## cbind(deaths, confirmed - deaths) ~ (county + state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_with_access_to_exercise_opportunities + 
##     percent_excessive_drinking + percent_uninsured + percent_some_college + 
##     percent_unemployed + percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + average_traffic_volume_per_meter_of_major_roadways + 
##     percent_homeowners + population_2 + percent_less_than_18_years_of_age + 
##     percent_65_and_over + percent_black + percent_asian + percent_hispanic + 
##     percent_female + percent_rural) - county
## 
##                                                      Df Deviance    AIC
## - percent_asian                                       1   1842.8 3878.7
## - percent_with_access_to_exercise_opportunities       1   1842.8 3878.7
## - average_traffic_volume_per_meter_of_major_roadways  1   1842.9 3878.8
## - percent_rural                                       1   1843.0 3878.9
## - percent_adults_with_obesity                         1   1843.2 3879.1
## - overcrowding                                        1   1843.3 3879.2
## - percent_65_and_over                                 1   1843.4 3879.3
## - percent_excessive_drinking                          1   1844.1 3880.0
## - percent_unemployed                                  1   1844.2 3880.1
## <none>                                                    1842.8 3880.6
## - percent_black                                       1   1845.0 3880.9
## - percent_adults_with_diabetes                        1   1845.2 3881.1
## - percent_uninsured_2                                 1   1845.6 3881.5
## - percent_female                                      1   1846.0 3881.9
## - percent_food_insecure                               1   1846.8 3882.7
## - population_2                                        1   1847.2 3883.1
## - percent_uninsured                                   1   1847.3 3883.2
## - percent_hispanic                                    1   1848.0 3883.9
## - percent_single_parent_households                    1   1848.7 3884.6
## - percent_smokers                                     1   1848.8 3884.7
## - percent_less_than_18_years_of_age                   1   1851.5 3887.4
## - percent_fair_or_poor_health                         1   1851.5 3887.4
## - percent_some_college                                1   1852.0 3887.9
## - median_household_income                             1   1853.5 3889.3
## - percent_children_in_poverty                         1   1858.4 3894.3
## - percent_severe_housing_problems                     1   1859.9 3895.8
## - percent_insufficient_sleep                          1   1862.8 3898.6
## - percent_homeowners                                  1   1873.7 3909.6
## - state                                              50   2413.8 4351.7
## 
## Step:  AIC=3878.66
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_with_access_to_exercise_opportunities + 
##     percent_excessive_drinking + percent_uninsured + percent_some_college + 
##     percent_unemployed + percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + average_traffic_volume_per_meter_of_major_roadways + 
##     percent_homeowners + population_2 + percent_less_than_18_years_of_age + 
##     percent_65_and_over + percent_black + percent_hispanic + 
##     percent_female + percent_rural
## 
##                                                      Df Deviance    AIC
## - percent_with_access_to_exercise_opportunities       1   1842.8 3876.7
## - average_traffic_volume_per_meter_of_major_roadways  1   1842.9 3876.8
## - percent_rural                                       1   1843.1 3877.0
## - percent_adults_with_obesity                         1   1843.2 3877.1
## - overcrowding                                        1   1843.4 3877.2
## - percent_65_and_over                                 1   1843.5 3877.4
## - percent_excessive_drinking                          1   1844.3 3878.2
## - percent_unemployed                                  1   1844.3 3878.2
## <none>                                                    1842.8 3878.7
## - percent_black                                       1   1845.1 3879.0
## - percent_adults_with_diabetes                        1   1845.3 3879.2
## - percent_uninsured_2                                 1   1845.6 3879.5
## - percent_female                                      1   1846.1 3880.0
## - percent_food_insecure                               1   1846.9 3880.8
## - population_2                                        1   1847.2 3881.1
## - percent_uninsured                                   1   1847.4 3881.3
## - percent_hispanic                                    1   1848.0 3881.9
## - percent_single_parent_households                    1   1848.7 3882.6
## - percent_smokers                                     1   1848.8 3882.7
## - percent_fair_or_poor_health                         1   1851.6 3885.5
## - percent_less_than_18_years_of_age                   1   1852.0 3885.9
## - percent_some_college                                1   1852.3 3886.2
## - median_household_income                             1   1855.6 3889.5
## - percent_children_in_poverty                         1   1859.0 3892.9
## - percent_severe_housing_problems                     1   1860.9 3894.8
## - percent_insufficient_sleep                          1   1863.0 3896.9
## - percent_homeowners                                  1   1873.8 3907.7
## - state                                              50   2436.5 4372.4
## 
## Step:  AIC=3876.67
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_excessive_drinking + 
##     percent_uninsured + percent_some_college + percent_unemployed + 
##     percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + average_traffic_volume_per_meter_of_major_roadways + 
##     percent_homeowners + population_2 + percent_less_than_18_years_of_age + 
##     percent_65_and_over + percent_black + percent_hispanic + 
##     percent_female + percent_rural
## 
##                                                      Df Deviance    AIC
## - average_traffic_volume_per_meter_of_major_roadways  1   1842.9 3874.8
## - percent_adults_with_obesity                         1   1843.2 3875.1
## - percent_rural                                       1   1843.3 3875.2
## - overcrowding                                        1   1843.4 3875.3
## - percent_65_and_over                                 1   1843.5 3875.4
## - percent_excessive_drinking                          1   1844.3 3876.2
## - percent_unemployed                                  1   1844.3 3876.2
## <none>                                                    1842.8 3876.7
## - percent_black                                       1   1845.1 3877.0
## - percent_adults_with_diabetes                        1   1845.3 3877.2
## - percent_uninsured_2                                 1   1845.7 3877.6
## - percent_female                                      1   1846.1 3878.0
## - percent_food_insecure                               1   1847.0 3878.9
## - population_2                                        1   1847.2 3879.1
## - percent_uninsured                                   1   1847.5 3879.4
## - percent_hispanic                                    1   1848.1 3880.0
## - percent_single_parent_households                    1   1848.7 3880.6
## - percent_smokers                                     1   1848.8 3880.7
## - percent_fair_or_poor_health                         1   1851.6 3883.5
## - percent_less_than_18_years_of_age                   1   1852.1 3884.0
## - percent_some_college                                1   1852.4 3884.3
## - median_household_income                             1   1855.7 3887.6
## - percent_children_in_poverty                         1   1859.0 3890.9
## - percent_severe_housing_problems                     1   1861.7 3893.6
## - percent_insufficient_sleep                          1   1863.0 3894.9
## - percent_homeowners                                  1   1874.2 3906.1
## - state                                              50   2442.1 4376.0
## 
## Step:  AIC=3874.78
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_excessive_drinking + 
##     percent_uninsured + percent_some_college + percent_unemployed + 
##     percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + percent_homeowners + population_2 + 
##     percent_less_than_18_years_of_age + percent_65_and_over + 
##     percent_black + percent_hispanic + percent_female + percent_rural
## 
##                                     Df Deviance    AIC
## - percent_rural                      1   1843.4 3873.3
## - percent_adults_with_obesity        1   1843.4 3873.3
## - overcrowding                       1   1843.6 3873.5
## - percent_65_and_over                1   1843.7 3873.6
## - percent_excessive_drinking         1   1844.3 3874.2
## - percent_unemployed                 1   1844.5 3874.4
## <none>                                   1842.9 3874.8
## - percent_black                      1   1845.3 3875.2
## - percent_adults_with_diabetes       1   1845.4 3875.3
## - percent_uninsured_2                1   1845.7 3875.6
## - percent_female                     1   1846.9 3876.8
## - percent_food_insecure              1   1847.0 3876.9
## - population_2                       1   1847.2 3877.1
## - percent_uninsured                  1   1847.5 3877.4
## - percent_hispanic                   1   1848.1 3878.0
## - percent_smokers                    1   1849.0 3878.8
## - percent_single_parent_households   1   1849.2 3879.1
## - percent_fair_or_poor_health        1   1851.6 3881.5
## - percent_less_than_18_years_of_age  1   1852.2 3882.1
## - percent_some_college               1   1853.1 3883.0
## - median_household_income            1   1856.3 3886.2
## - percent_children_in_poverty        1   1859.0 3888.9
## - percent_severe_housing_problems    1   1862.8 3892.7
## - percent_insufficient_sleep         1   1864.0 3893.9
## - percent_homeowners                 1   1879.0 3908.9
## - state                             50   2463.7 4395.6
## 
## Step:  AIC=3873.28
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_adults_with_obesity + percent_excessive_drinking + 
##     percent_uninsured + percent_some_college + percent_unemployed + 
##     percent_children_in_poverty + percent_single_parent_households + 
##     percent_severe_housing_problems + overcrowding + percent_adults_with_diabetes + 
##     percent_food_insecure + percent_insufficient_sleep + percent_uninsured_2 + 
##     median_household_income + percent_homeowners + population_2 + 
##     percent_less_than_18_years_of_age + percent_65_and_over + 
##     percent_black + percent_hispanic + percent_female
## 
##                                     Df Deviance    AIC
## - percent_adults_with_obesity        1   1843.9 3871.8
## - percent_65_and_over                1   1844.2 3872.1
## - overcrowding                       1   1844.5 3872.4
## - percent_unemployed                 1   1844.9 3872.8
## - percent_excessive_drinking         1   1845.0 3872.9
## <none>                                   1843.4 3873.3
## - percent_black                      1   1845.4 3873.3
## - percent_adults_with_diabetes       1   1845.7 3873.6
## - percent_uninsured_2                1   1846.0 3873.9
## - percent_female                     1   1847.0 3874.9
## - percent_food_insecure              1   1847.1 3875.0
## - percent_uninsured                  1   1847.7 3875.6
## - population_2                       1   1847.9 3875.8
## - percent_hispanic                   1   1848.2 3876.1
## - percent_smokers                    1   1849.3 3877.2
## - percent_single_parent_households   1   1850.5 3878.3
## - percent_fair_or_poor_health        1   1852.0 3879.9
## - percent_less_than_18_years_of_age  1   1854.3 3882.2
## - percent_some_college               1   1854.5 3882.4
## - median_household_income            1   1857.1 3885.0
## - percent_children_in_poverty        1   1859.0 3886.9
## - percent_insufficient_sleep         1   1864.1 3892.0
## - percent_severe_housing_problems    1   1866.6 3894.5
## - percent_homeowners                 1   1880.1 3908.0
## - state                             50   2484.7 4414.6
## 
## Step:  AIC=3871.83
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_excessive_drinking + percent_uninsured + 
##     percent_some_college + percent_unemployed + percent_children_in_poverty + 
##     percent_single_parent_households + percent_severe_housing_problems + 
##     overcrowding + percent_adults_with_diabetes + percent_food_insecure + 
##     percent_insufficient_sleep + percent_uninsured_2 + median_household_income + 
##     percent_homeowners + population_2 + percent_less_than_18_years_of_age + 
##     percent_65_and_over + percent_black + percent_hispanic + 
##     percent_female
## 
##                                     Df Deviance    AIC
## - overcrowding                       1   1845.1 3871.0
## - percent_65_and_over                1   1845.1 3871.0
## - percent_excessive_drinking         1   1845.3 3871.2
## - percent_unemployed                 1   1845.4 3871.3
## - percent_black                      1   1845.8 3871.7
## <none>                                   1843.9 3871.8
## - percent_uninsured_2                1   1846.5 3872.4
## - percent_adults_with_diabetes       1   1847.3 3873.2
## - percent_female                     1   1847.5 3873.4
## - percent_food_insecure              1   1847.5 3873.4
## - percent_uninsured                  1   1848.1 3874.0
## - population_2                       1   1848.1 3874.0
## - percent_hispanic                   1   1848.6 3874.5
## - percent_smokers                    1   1850.5 3876.3
## - percent_single_parent_households   1   1851.2 3877.1
## - percent_fair_or_poor_health        1   1852.7 3878.6
## - percent_some_college               1   1854.5 3880.4
## - percent_less_than_18_years_of_age  1   1855.0 3880.9
## - median_household_income            1   1857.2 3883.1
## - percent_children_in_poverty        1   1860.0 3885.9
## - percent_insufficient_sleep         1   1864.1 3890.0
## - percent_severe_housing_problems    1   1866.7 3892.5
## - percent_homeowners                 1   1881.8 3907.7
## - state                             50   2487.0 4414.9
## 
## Step:  AIC=3870.96
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_excessive_drinking + percent_uninsured + 
##     percent_some_college + percent_unemployed + percent_children_in_poverty + 
##     percent_single_parent_households + percent_severe_housing_problems + 
##     percent_adults_with_diabetes + percent_food_insecure + percent_insufficient_sleep + 
##     percent_uninsured_2 + median_household_income + percent_homeowners + 
##     population_2 + percent_less_than_18_years_of_age + percent_65_and_over + 
##     percent_black + percent_hispanic + percent_female
## 
##                                     Df Deviance    AIC
## - percent_65_and_over                1   1846.2 3870.1
## - percent_excessive_drinking         1   1846.4 3870.3
## - percent_black                      1   1846.9 3870.8
## <none>                                   1845.1 3871.0
## - percent_unemployed                 1   1847.2 3871.1
## - percent_uninsured_2                1   1847.9 3871.8
## - percent_female                     1   1848.1 3872.0
## - population_2                       1   1848.3 3872.2
## - percent_adults_with_diabetes       1   1848.6 3872.5
## - percent_uninsured                  1   1849.5 3873.4
## - percent_food_insecure              1   1849.6 3873.5
## - percent_hispanic                   1   1850.8 3874.7
## - percent_single_parent_households   1   1852.2 3876.1
## - percent_fair_or_poor_health        1   1852.8 3876.6
## - percent_smokers                    1   1853.0 3876.9
## - percent_less_than_18_years_of_age  1   1855.4 3879.3
## - percent_some_college               1   1857.4 3881.3
## - median_household_income            1   1857.7 3881.6
## - percent_children_in_poverty        1   1862.5 3886.4
## - percent_insufficient_sleep         1   1864.5 3888.4
## - percent_severe_housing_problems    1   1867.1 3891.0
## - percent_homeowners                 1   1899.1 3923.0
## - state                             50   2489.7 4415.6
## 
## Step:  AIC=3870.06
## cbind(deaths, confirmed - deaths) ~ state + percent_fair_or_poor_health + 
##     percent_smokers + percent_excessive_drinking + percent_uninsured + 
##     percent_some_college + percent_unemployed + percent_children_in_poverty + 
##     percent_single_parent_households + percent_severe_housing_problems + 
##     percent_adults_with_diabetes + percent_food_insecure + percent_insufficient_sleep + 
##     percent_uninsured_2 + median_household_income + percent_homeowners + 
##     population_2 + percent_less_than_18_years_of_age + percent_black + 
##     percent_hispanic + percent_female
## 
##                                     Df Deviance    AIC
## <none>                                   1846.2 3870.1
## - percent_excessive_drinking         1   1848.2 3870.1
## - percent_black                      1   1848.7 3870.6
## - percent_uninsured_2                1   1849.0 3870.9
## - percent_adults_with_diabetes       1   1849.5 3871.4
## - percent_unemployed                 1   1849.7 3871.6
## - percent_uninsured                  1   1850.5 3872.4
## - population_2                       1   1850.5 3872.4
## - percent_food_insecure              1   1852.7 3874.5
## - percent_female                     1   1853.1 3875.0
## - percent_fair_or_poor_health        1   1854.0 3875.9
## - percent_single_parent_households   1   1854.2 3876.1
## - percent_hispanic                   1   1854.3 3876.2
## - percent_smokers                    1   1855.0 3876.9
## - percent_some_college               1   1861.3 3883.2
## - median_household_income            1   1861.6 3883.5
## - percent_insufficient_sleep         1   1864.6 3886.5
## - percent_severe_housing_problems    1   1867.9 3889.8
## - percent_children_in_poverty        1   1870.8 3892.7
## - percent_less_than_18_years_of_age  1   1875.2 3897.1
## - percent_homeowners                 1   1900.0 3921.9
## - state                             50   2496.7 4420.6
Characteristic log(OR)1 95% CI1 p-value
state


    Alabama — —
    Alaska -0.90 -3.8, 0.70 0.4
    Arizona 0.69 0.04, 1.3 0.039
    Arkansas -1.1 -2.2, -0.27 0.020
    California 0.28 -0.29, 0.86 0.3
    Colorado 0.47 -0.02, 0.97 0.062
    Connecticut 0.03 -0.45, 0.52 >0.9
    Delaware -0.61 -1.3, 0.05 0.082
    District of Columbia -1.4 -2.4, -0.50 0.003
    Florida 0.30 -0.12, 0.73 0.2
    Georgia 0.20 -0.18, 0.59 0.3
    Hawaii -1.5 -3.0, -0.33 0.023
    Idaho -0.56 -1.4, 0.21 0.2
    Illinois 0.11 -0.29, 0.53 0.6
    Indiana 0.42 0.01, 0.85 0.049
    Iowa -0.42 -1.2, 0.28 0.3
    Kansas 0.39 -0.27, 1.0 0.2
    Kentucky 0.38 -0.13, 0.89 0.14
    Louisiana 0.10 -0.29, 0.51 0.6
    Maine -0.54 -1.5, 0.27 0.2
    Maryland -0.77 -1.3, -0.26 0.003
    Massachusetts -0.94 -1.4, -0.43 <0.001
    Michigan 0.25 -0.13, 0.64 0.2
    Minnesota -0.29 -0.94, 0.34 0.4
    Mississippi 0.24 -0.24, 0.71 0.3
    Missouri -0.33 -0.83, 0.16 0.2
    Montana -0.03 -1.0, 0.81 >0.9
    Nebraska -0.23 -1.5, 0.73 0.7
    Nevada 0.31 -0.20, 0.82 0.2
    New Hampshire -1.9 -3.3, -0.85 0.002
    New Jersey 0.01 -0.40, 0.43 >0.9
    New Mexico 0.85 -0.21, 1.9 0.11
    New York -0.89 -1.4, -0.42 <0.001
    North Carolina -0.63 -1.1, -0.15 0.010
    North Dakota -0.94 -2.8, 0.34 0.2
    Ohio -0.06 -0.49, 0.37 0.8
    Oklahoma 0.86 0.32, 1.4 0.002
    Oregon 0.05 -0.57, 0.64 0.9
    Pennsylvania -0.81 -1.2, -0.37 <0.001
    Rhode Island -2.2 -3.7, -1.1 <0.001
    South Carolina -0.53 -1.0, -0.05 0.031
    South Dakota -0.70 -2.5, 0.54 0.4
    Tennessee -0.56 -1.0, -0.12 0.013
    Texas 0.89 0.21, 1.6 0.011
    Utah -0.94 -1.9, -0.08 0.041
    Vermont -1.2 -2.3, -0.25 0.022
    Virginia -1.5 -2.2, -0.94 <0.001
    Washington 0.52 0.07, 0.98 0.025
    West Virginia -1.5 -3.4, -0.33 0.036
    Wisconsin -0.19 -0.72, 0.33 0.5
    Wyoming -14 -402, -478 >0.9
percent_fair_or_poor_health -0.05 -0.08, -0.01 0.005
percent_smokers -0.05 -0.08, -0.02 0.003
percent_excessive_drinking -0.02 -0.04, 0.01 0.2
percent_uninsured -0.18 -0.35, -0.01 0.039
percent_some_college -0.02 -0.03, -0.01 <0.001
percent_unemployed 0.08 0.00, 0.16 0.055
percent_children_in_poverty 0.03 0.02, 0.05 <0.001
percent_single_parent_households -0.02 -0.03, -0.01 0.004
percent_severe_housing_problems -0.04 -0.06, -0.02 <0.001
percent_adults_with_diabetes -0.02 -0.05, 0.00 0.070
percent_food_insecure -0.11 -0.19, -0.02 0.011
percent_insufficient_sleep 0.04 0.02, 0.06 <0.001
percent_uninsured_2 0.12 -0.02, 0.26 0.10
median_household_income 0.00 0.00, 0.00 <0.001
percent_homeowners -0.04 -0.05, -0.03 <0.001
population_2 0.00 0.00, 0.00 0.038
percent_less_than_18_years_of_age -0.06 -0.08, -0.04 <0.001
percent_black 0.01 0.00, 0.02 0.11
percent_hispanic -0.02 -0.04, -0.01 0.005
percent_female 0.06 0.01, 0.10 0.010
1 OR = Odds Ratio, CI = Confidence Interval

Q1.11

Find the best sub-model using the lasso with cross validation.

Answer:

county_data
## # A tibble: 1,446 × 31
##    county    confirmed deaths state       percent_fair_or_poor…¹ percent_smokers
##    <chr>         <dbl>  <dbl> <fct>                        <dbl>           <dbl>
##  1 Abbeville         6      0 South Caro…                   19.9            17.3
##  2 Acadia           65      2 Louisiana                     20.9            21.5
##  3 Accomack          8      0 Virginia                      20.1            18.3
##  4 Ada             360      3 Idaho                         11.5            12.0
##  5 Adair            10      0 Missouri                      21.4            20.5
##  6 Adair            14      0 Oklahoma                      28.5            27.7
##  7 Adams           294      9 Colorado                      16.6            16.3
##  8 Adams            16      0 Mississippi                   27.3            22.2
##  9 Adams             8      0 Nebraska                      15.8            14.6
## 10 Adams            21      0 Pennsylvan…                   15.3            16.2
## # ℹ 1,436 more rows
## # ℹ abbreviated name: ¹​percent_fair_or_poor_health
## # ℹ 25 more variables: percent_adults_with_obesity <dbl>,
## #   percent_with_access_to_exercise_opportunities <dbl>,
## #   percent_excessive_drinking <dbl>, percent_uninsured <dbl>,
## #   percent_some_college <dbl>, percent_unemployed <dbl>,
## #   percent_children_in_poverty <dbl>, …
x <- model.matrix(~ . +state -1 - county - confirmed - deaths, data = county_data)[, -1]


# Response vector for deaths
y_deaths <- county_data$deaths

# Response vector for non_deaths
y_non_deaths <- county_data$confirmed - county_data$deaths

y <- as.matrix(cbind(y_deaths, y_non_deaths))

# Fit the binomial regression model using Lasso
binomial_lasso_model <- glmnet::glmnet(x, y, family = "binomial", alpha = 1)
# Plotting the coefficient path
plot(binomial_lasso_model, xvar = "lambda", label = TRUE)

cv_binomial_lasso <- glmnet::cv.glmnet(x, y, family = "binomial", nfolds = 5, alpha = 1, type.measure = "auc")
plot(cv_binomial_lasso)

best_lambda <- cv_binomial_lasso$lambda.min
best_lambda
## [1] 0.0001182757
# Fit the final model using the best lambda
final_model <- glmnet::glmnet(x, y, family = "binomial", alpha = 1, lambda = best_lambda)
# Extract coefficients at the best lambda
coefficients <- coef(final_model, s = best_lambda)

# Convert to a regular matrix or data frame for easier viewing
coefficients_df <- as.data.frame(as.matrix(coefficients))

# Remove zero coefficients and intercept
non_zero_coefficients <- coefficients_df[coefficients_df$s1 != 0, , drop = FALSE]

# Print non-zero coefficients
print(non_zero_coefficients)
##                                                          s1
## (Intercept)                                    1.888631e+00
## stateAlaska                                    5.125982e-01
## stateArizona                                  -7.336277e-02
## stateArkansas                                  9.461641e-01
## stateCalifornia                               -2.829404e-01
## stateColorado                                 -3.739685e-01
## stateConnecticut                              -4.518015e-01
## stateDelaware                                  5.382125e-02
## stateDistrict of Columbia                      8.811146e-02
## stateFlorida                                   3.957473e-02
## stateGeorgia                                  -3.265988e-01
## stateHawaii                                    7.185170e-01
## stateIdaho                                     5.585327e-01
## stateIllinois                                 -2.012604e-01
## stateIndiana                                  -2.295655e-01
## stateIowa                                      2.881274e-01
## stateKansas                                   -5.782134e-02
## stateKentucky                                 -3.843457e-01
## stateLouisiana                                -3.563468e-01
## stateMaine                                     2.977910e-01
## stateMaryland                                  1.712455e-01
## stateMassachusetts                             2.453251e-01
## stateMichigan                                 -4.920318e-01
## stateMississippi                              -4.463749e-03
## stateMissouri                                  4.218670e-01
## stateNebraska                                  2.642358e-01
## stateNevada                                   -1.243997e-01
## stateNew Hampshire                             1.029825e+00
## stateNew Jersey                               -3.956107e-01
## stateNew York                                  2.657473e-01
## stateNorth Carolina                            4.924251e-01
## stateNorth Dakota                              3.711830e-01
## stateOklahoma                                 -4.033990e-01
## stateOregon                                   -9.114957e-02
## statePennsylvania                              5.295978e-01
## stateRhode Island                              1.184213e+00
## stateSouth Carolina                            1.792111e-01
## stateSouth Dakota                              4.275367e-01
## stateTennessee                                 2.780528e-01
## stateUtah                                      9.183816e-01
## stateVermont                                   5.167664e-01
## stateVirginia                                  1.024569e+00
## stateWashington                               -7.338440e-01
## stateWest Virginia                             7.793073e-01
## stateWyoming                                   1.598929e+00
## percent_fair_or_poor_health                    1.797679e-02
## percent_smokers                                1.859377e-02
## percent_with_access_to_exercise_opportunities  4.733432e-04
## percent_uninsured                              2.825085e-03
## percent_unemployed                            -9.570811e-03
## percent_children_in_poverty                   -8.011092e-03
## percent_single_parent_households               1.225111e-03
## percent_severe_housing_problems                4.587573e-02
## percent_food_insecure                         -4.936420e-04
## percent_insufficient_sleep                    -1.522557e-02
## median_household_income                        3.707579e-06
## percent_homeowners                             2.060750e-02
## percent_less_than_18_years_of_age              2.966221e-03
## percent_65_and_over                           -3.333649e-02
## percent_asian                                 -5.370531e-03
## percent_hispanic                               3.887693e-03
## percent_rural                                 -4.943494e-04

Q2. Odds ratios

Consider a \(2 \times 2\) contingency table from a prospective study in which people who were or were not exposed to some pollutant are followed up and, after several years, categorized according to the presense or absence of a disease. Following table shows the probabilities for each cell. The odds of disease for either exposure group is \(O_i = \pi_i / (1 - \pi_i)\), for \(i = 1,2\), and so the odds ratio is \[ \phi = \frac{O_1}{O_2} = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} \] is a measure of the relative likelihood of disease for the exposed and not exposed groups.

Diseased Not diseased
Exposed \(\pi_1\) \(1 - \pi_1\)
Not exposed \(\pi_2\) \(1 - \pi_2\)

Q2.1

For the simple logistic model \[ \pi_i = \frac{e^{\beta_i}}{1 + e^{\beta_i}}, \] show that if there is no difference between the exposed and not exposed groups (i.e., \(\beta_1 = \beta_2\)), then \(\phi = 1\).

Answer:

\(\pi_1 = \frac{e^{\beta_1}}{1 + e^{\beta_1}} = \frac{e^{\beta_2}}{1 + e^{\beta_2}} = \pi_2\)

\(\pi_1(1-\pi_2) = \pi_2(1-\pi_1)\) by above statement

\(\phi = \frac{\pi_1(1 - \pi_2)}{\pi_2 (1 - \pi_1)} = 1\)

Q2.2

Consider \(J\) \(2 \times 2\) tables, one for each level \(x_j\) of a factor, such as age group, with \(j=1,\ldots, J\). For the logistic model \[ \pi_{ij} = \frac{e^{\alpha_i + \beta_i x_j}}{1 + e^{\alpha_i + \beta_i x_j}}, \quad i = 1,2, \quad j= 1,\ldots, J. \] Show that \(\log \phi\) is constant over all tables if \(\beta_1 = \beta_2\).

Answer:

\(\pi_{1j}=\frac{e^{\alpha_1+\beta_1x_j}}{1+e^{\alpha_1+\beta_1x_j}}\)

\(\pi_{2j}=\frac{e^{\alpha_2+\beta_2x_j}}{1+e^{\alpha_2+\beta_2x_j}}\)

\[\begin{align*} \log \phi &= \log \frac{\pi_{1j}(1 - \pi_{2j})}{\pi_{2j} (1 - \pi_{1j})}\\ &=\log \pi_{1j}+\log(1-\pi_{2j})-\log \pi_{2j}-\log (1-\pi_{1j})\\ &=\log \frac{e^{\alpha_1+\beta_1x_j}}{1+e^{\alpha_1+\beta_1x_j}}+\log(1-\frac{e^{\alpha_2+\beta_2x_j}}{1+e^{\alpha_2+\beta_2x_j}})-\log \frac{e^{\alpha_2+\beta_2x_j}}{1+e^{\alpha_2+\beta_2x_j}}-\log (1-\frac{e^{\alpha_1+\beta_1x_j}}{1+e^{\alpha_1+\beta_1x_j}})\\ &=\alpha_1+\beta_1x_j-\log(1+e^{\alpha_1+\beta_1x_j})+log(\frac{1}{1+e^{\alpha_2+\beta_2x_j}})-\alpha_2-\beta_2x_j+\log(1+e^{\alpha_2+\beta_2x_j})-log(\frac{1}{1+e^{\alpha_1+\beta_1x_j}})\\ &=\alpha_1-\alpha_2+\beta_1x_j-\beta_2x_j\\ &=\alpha_1-\alpha_2 \end{align*}\]

Q3. ELMR Chapter 4 Excercise 3

**

(a) Answer:

library(datasets)
head(infert)
##   education age parity induced case spontaneous stratum pooled.stratum
## 1    0-5yrs  26      6       1    1           2       1              3
## 2    0-5yrs  42      1       1    1           0       2              1
## 3    0-5yrs  39      6       2    1           0       3              4
## 4    0-5yrs  34      4       2    1           0       4              2
## 5   6-11yrs  35      3       1    1           1       5             32
## 6   6-11yrs  36      4       2    1           1       6             36
group1 <- infert |>
  filter(case == 1)
group2 <- infert |>
  filter(case == 0)

group1$induced <- as.factor(group1$induced)
group1$spontaneous <- as.factor(group1$spontaneous)

group2$induced <- as.factor(group2$induced)
group2$spontaneous <- as.factor(group2$spontaneous)
table(group1$induced, group1$spontaneous)
##    
##      0  1  2
##   0  7 22 18
##   1 12  5  6
##   2  9  4  0
table(group2$induced, group2$spontaneous)
##    
##      0  1  2
##   0 60 25 11
##   1 33 11  1
##   2 20  4  0

We can see table 2 (control) has more counts when both induced and spontaneous are 0 than table 1 (cases). Table 2 also seems to contain more observations than table 1 in general, which indicate there are more control observations than cases.

(b) Answer:

mod <- glm(case ~ spontaneous + induced, data = infert, family = binomial)
summary(mod)
## 
## Call:
## glm(formula = case ~ spontaneous + induced, family = binomial, 
##     data = infert)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.7079     0.2677  -6.380 1.78e-10 ***
## spontaneous   1.1972     0.2116   5.657 1.54e-08 ***
## induced       0.4181     0.2056   2.033    0.042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 279.61  on 245  degrees of freedom
## AIC: 285.61
## 
## Number of Fisher Scoring iterations: 4

Both spontaneous and induced are significant predictors as their p values are smaller than 0.05.

spontaneous For one unit increases in spontaneous, the odds of case increases by 231 percent, adjusting for induced.

induced For one unit increases in induced, the odds of case increases by 52 percent, adjusting for spontaneous.

(c) Answer:

mod2 <- glm(case ~ education + age + parity, data = infert, family = binomial)
summary(mod2)
## 
## Call:
## glm(formula = case ~ education + age + parity, family = binomial, 
##     data = infert)
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -0.847542   1.245895  -0.680    0.496
## education6-11yrs  0.046079   0.693754   0.066    0.947
## education12+ yrs  0.069988   0.718125   0.097    0.922
## age               0.002076   0.027245   0.076    0.939
## parity            0.019070   0.117221   0.163    0.871
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 316.14  on 243  degrees of freedom
## AIC: 326.14
## 
## Number of Fisher Scoring iterations: 4

eudcation is a categorical variable so the model uses education0-5yrs as the reference group. According to the p values, education0-5yrs, education6-11yrs, education12+ yrs, age, and parity are not significant predictors. In specific, we do not have enough evidences to conclude that these predictors are significant in predicting case.

(d) Answer:

modall <- glm(case ~ education + age + parity + spontaneous + induced, data = infert, family = binomial)
summary(modall)
## 
## Call:
## glm(formula = case ~ education + age + parity + spontaneous + 
##     induced, family = binomial, data = infert)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.14924    1.41220  -0.814   0.4158    
## education6-11yrs -1.04424    0.79255  -1.318   0.1876    
## education12+ yrs -1.40321    0.83416  -1.682   0.0925 .  
## age               0.03958    0.03120   1.269   0.2046    
## parity           -0.82828    0.19649  -4.215 2.49e-05 ***
## spontaneous       2.04591    0.31016   6.596 4.21e-11 ***
## induced           1.28876    0.30146   4.275 1.91e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 257.80  on 241  degrees of freedom
## AIC: 271.8
## 
## Number of Fisher Scoring iterations: 4

education6-11yrs - When the observation has education 6-11 years, the odds of case is 64.2 percent lower than observation with education 0-5 years, adjusting for other predictors.

education12+ yrs - When the observation has education 12+ years, the odds of case is 75.4 percent lower than observation with education 0-5 years, adjusting for other predictors.

age - For one unit increases in age, the odds of case increases by 4 percent, adjusting for other predictors.

parity - For one unit increases in parity, the odds of case decreases by 56.3 percent, adjusting for other predictors.

spontaneous - For one unit increases in spontaneous, the odds of case increases by 674 percent, adjusting for other predictors.

induced - For one unit increases in induced, the odds of case increases by 263 percent, adjusting for other predictors.

(e) Answer:

cmod <- clogit(case ~ education + age + parity + spontaneous + induced + 
                 strata(stratum), data = infert)
summary(cmod)
## Call:
## coxph(formula = Surv(rep(1, 248L), case) ~ education + age + 
##     parity + spontaneous + induced + strata(stratum), data = infert, 
##     method = "exact")
## 
##   n= 248, number of events= 83 
## 
##                    coef exp(coef) se(coef)     z Pr(>|z|)    
## education6-11yrs     NA        NA   0.0000    NA       NA    
## education12+ yrs     NA        NA   0.0000    NA       NA    
## age                  NA        NA   0.0000    NA       NA    
## parity               NA        NA   0.0000    NA       NA    
## spontaneous      1.9859    7.2854   0.3524 5.635 1.75e-08 ***
## induced          1.4090    4.0919   0.3607 3.906 9.38e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## education6-11yrs        NA         NA        NA        NA
## education12+ yrs        NA         NA        NA        NA
## age                     NA         NA        NA        NA
## parity                  NA         NA        NA        NA
## spontaneous          7.285     0.1373     3.651    14.536
## induced              4.092     0.2444     2.018     8.298
## 
## Concordance= 0.776  (se = 0.044 )
## Likelihood ratio test= 53.15  on 2 df,   p=3e-12
## Wald test            = 31.84  on 2 df,   p=1e-07
## Score (logrank) test = 48.44  on 2 df,   p=3e-11

The output is interesting. There is no intercept for the model. Since education, age, and parity are confounders we try to control, they have NA for all their associated values. The output also shows exponential coefficients and their SEs and CIs.

For comparison, the output effect sizes for spontaneous and induced are similar to adjusted model modall in part (d) but very different with unadjusted model mod in part (b). Here is detailed interpretation in term of odds.

spontanuous - For one unit increases in spontaneous, the odds of case increase by 628 percent, adjusting for other predictors. induced - For one unit increases in induced, the odds of case increase by 309 percent, adjusting for other predictors.

The odds given here are relatively close to the odds in part (d).

(f) Answer:

infert <- infert |>
  mutate(spontaneous = factor(spontaneous, ordered = TRUE, levels = c("0", "1", "2")),
         induced = factor(induced, ordered = TRUE, levels = c("0", "1", "2")))
mod2 <- glm(case ~ spontaneous + induced, data = infert, family = binomial)
summary(mod2)
## 
## Call:
## glm(formula = case ~ spontaneous + induced, family = binomial, 
##     data = infert)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.10127    0.18998  -0.533   0.5940    
## spontaneous.L  1.66438    0.31104   5.351 8.75e-08 ***
## spontaneous.Q -0.09177    0.26012  -0.353   0.7242    
## induced.L      0.58336    0.30551   1.909   0.0562 .  
## induced.Q     -0.03947    0.28133  -0.140   0.8884    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 279.47  on 243  degrees of freedom
## AIC: 289.47
## 
## Number of Fisher Scoring iterations: 4
# For comparison purpose, here is without transforming them to nominal factors.
summary(mod)
## 
## Call:
## glm(formula = case ~ spontaneous + induced, family = binomial, 
##     data = infert)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.7079     0.2677  -6.380 1.78e-10 ***
## spontaneous   1.1972     0.2116   5.657 1.54e-08 ***
## induced       0.4181     0.2056   2.033    0.042 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 316.17  on 247  degrees of freedom
## Residual deviance: 279.61  on 245  degrees of freedom
## AIC: 285.61
## 
## Number of Fisher Scoring iterations: 4

By comparing these two models, we can see mod2(nominal factors) has larger AIC than mod1 which is an indicator that mod1 is a better model. Therefore, we should not transform spontaneous and induced to nominal factors.